aaanalysis.filter_seq

class aaanalysis.filter_seq(df_seq=None, method='cd-hit', similarity_threshold=0.9, word_size=None, global_identity=True, coverage_long=None, coverage_short=None, cluster_order=None, n_jobs=1, verbose=False)[source]

Bases:

Redundancy reduction of sequences using clustering-based algorithms.

This functions performs redundancy reduction of sequences by clustering and selecting representative sequences using the CD-HIT ([Li06]) or MMseqs2 ([Steinegger17]) algorithms locally. It allows for adjustable filtering strictness:

  • Strict filtering results in smaller, more homogeneous clusters, suitable when high sequence similarity is required.

  • Non-strict filtering creates larger, more diverse clusters, enhancing sequence representation.

CD-HIT and MMseq2 are standalone software tools, each requiring separate installation. CD-Hit is more resource-efficient and easier to install, while MMseq2 is a larger multi-purpose tool. Pairwise sequence similarities for the MMseq2 clustering results were computed using the Biopython Bio.Align.PairwiseAligner class.

Added in version 1.0.0.

Parameters:
  • df_seq (pd.DataFrame, shape (n_samples, n>=1)) – DataFrame containing an entry column with unique protein identifiers and a sequence column with full protein sequences. Sequence length must be at least 11 amino acids and the sequence should not contain any gaps (‘-‘).

  • method ({'cd-hit', 'mmseqs'}, default='cd-hit') –

    Specifies the clustering algorithm to use:

    • cd-hit: Efficiently clusters sequences, ideal for small- to medium-sized datasets.

    • mmseqs: Advanced algorithm designed for large-scale sequence analysis, offering high accuracy.

  • similarity_threshold (float, default=0.9) – Defines the minimum sequence identity [0.4-1.0] for clustering. Higher values increase strictness.

  • word_size (int, optional) – The size of the ‘word’ (in CD-HIT, [2-5]) or ‘k-mer’ (in MMseqs, [5-7]) used for the initial screening step in clustering. Effect on strictness is dataset-dependent. If None, optimized based on similarity_threshold (CD-Hit).

  • global_identity (bool, default=True) – Whether to use global (True) or local (False) sequence identity for ‘cd-hit’ clustering. Global is stricter. 80%-coverage is used for local ‘cd-hit’ clustering if not specified. MMseq2 uses only local alignments.

  • coverage_long (float, optional) – Minimum percentage [0.1-1.0] of the longer sequence that must be included in the alignment. Higher values increase strictness.

  • coverage_short (float, optional) – Minimum percentage [0.1-1.0] of the shorter sequence that must be included in the alignment. Higher values increase strictness.

  • cluster_order ({None, 'size', 'input'}, default=None) –

    Defines the ordering of sequences in the output DataFrame:

    • None: preserve the default order returned by the clustering algorithm.

    • 'size': group clusters and sort them by decreasing cluster size.

    • 'input': reorder sequences to match the original input order.

    Use 'input' when clustering results should remain aligned with downstream analyses based on the original sequence order.

  • n_jobs (int, None, or -1, default=1) – Number of CPU cores used for multiprocessing. If -1 or None, the number is set to all available cores.

  • verbose (bool, default=False) – If True, verbose outputs are enabled.

Returns:

df_clust – A DataFrame with clustering results.

Return type:

pd.DataFrame

Notes

  • CD-HIT and MMseqs2 use different approaches for clustering sequences:

    • CD-HIT sorts sequences by length and clusters them using global or local alignments against the longest sequence.

    • MMseqs2 employs an index-based approach and optimized algorithms for faster and more sensitive data handling.

  • Parameter Comparison:

Parameter

CD-HIT

MMseqs2

similarity_threshold

-c (sequence identity)

–min-seq-id (minimum sequence id)

word_size

-n (word length)

-k (k-mer size, auto-optimized)

coverage_Long

-aL (coverage of longer seq)

–cov-mode 0 -c (bidirectional)

coverage_short

-aS (coverage of shorter seq)

–cov-mode 1 -c (target coverage)

See also

Warning

  • This function requires biopython, which is automatically installed via pip install aaanalysis[pro].

  • CD-HIT and MMseq2 must be installed separately.

  • CD-HIT is not available for Windows.

Examples

To demonstrate the filter_seq() function, we load the SEQ_CAPSID example dataset (see [Breimann24a]):

import aaanalysis as aa
aa.options["verbose"] = False
df_seq = aa.load_dataset(name="SEQ_CAPSID", n=1000)

The filter_seq() function is a Python wrapper for two different sequence clustering and filtering algorithms, which have to be installed independently of the AAanaylsis package. Select them by setting the method parameter to cd-hit (default) or mmseqs:

# Filtering using CD-HIT (default)
df_clust = aa.filter_seq(df_seq=df_seq)
n_clust = df_clust["cluster"].nunique()
print(f"Number of CD-HIT clusters: {n_clust}")
aa.display_df(df_clust, n_rows=-5, show_shape=True)

# Filtering using MMSeqs
df_clust = aa.filter_seq(df_seq=df_seq, method="mmseqs")
n_clust = df_clust["cluster"].nunique()
print(f"Number of MMSeqs2 clusters: {n_clust}")
Number of CD-HIT clusters: 2000
DataFrame shape: (2000, 4)
  entry cluster identity_with_rep is_representative
1996 CAPSID_4517 1995 100.000000 1
1997 CAPSID_4516 1996 100.000000 1
1998 CAPSID_4300 1997 100.000000 1
1999 CAPSID_4108 1998 100.000000 1
2000 CAPSID_4984 1999 100.000000 1
Number of MMSeqs2 clusters: 1998

You can obtain a redundancy-reduced set of protein sequences by selecting the representative sequence of each cluster:

# Select redundancy-reduced sequences
df_selected = df_clust[df_clust["is_representative"] == 1]
aa.display_df(df_selected, n_rows=-5, show_shape=True)
DataFrame shape: (1998, 4)
  entry cluster identity_with_rep is_representative
1996 CAPSID_4936 1993 100.000000 1
1997 CAPSID_4968 1994 100.000000 1
1998 CAPSID_5002 1995 100.000000 1
1999 CAPSID_5037 1996 100.000000 1
2000 CAPSID_5069 1997 100.000000 1

To reduce the number of clusters, you can decrease the sequence similarity_threshold (default=0.9):

# Decrease number of clusters by using lower sequence similarity
df_clust = aa.filter_seq(df_seq=df_seq, similarity_threshold=0.5)
n_clust = df_clust["cluster"].nunique()
print(f"Number of CD-HIT clusters: {n_clust}")

# Filtering with MMSeqs
df_clust = aa.filter_seq(df_seq=df_seq, method="mmseqs", similarity_threshold=0.5)
n_clust = df_clust["cluster"].nunique()
print(f"Number of MMSeqs2 clusters: {n_clust}")
Number of CD-HIT clusters: 1584
Number of MMSeqs2 clusters: 1603

Adjust the length of the subsequence (called ‘word’ or ‘k-mers’) using the word_size parameter, which is optimized by default depending on the similarity threshold:

df_clust = aa.filter_seq(df_seq=df_seq, similarity_threshold=0.5, word_size=2)
n_clust = df_clust["cluster"].nunique()
print(f"Number of CD-HIT clusters: {n_clust}")
Number of CD-HIT clusters: 1584

For cd-hit, you can change to local (less strict) sequence identity by setting global_identity=False:

# Clustering with local sequence identity
df_clust = aa.filter_seq(df_seq=df_seq, similarity_threshold=0.5, global_identity=False)
n_clust = df_clust["cluster"].nunique()
print(f"Number of CD-HIT clusters: {n_clust}")
Number of CD-HIT clusters: 1627

The minimum coverage of the longer and shorter sequence can be adjusted using the coverage_long and coverage_short parameters:

# Clustering with the highest sequence coverage
df_clust = aa.filter_seq(df_seq=df_seq, similarity_threshold=0.5, coverage_long=1, coverage_short=1)
n_clust = df_clust["cluster"].nunique()
print(f"Number of CD-HIT clusters (high coverage): {n_clust}")

# Clustering with the lowest sequence coverage
df_clust = aa.filter_seq(df_seq=df_seq, similarity_threshold=0.5, coverage_long=0.1, coverage_short=0.1)
n_clust = df_clust["cluster"].nunique()
print(f"Number of CD-HIT clusters (low coverage): {n_clust}")
Number of CD-HIT clusters (high coverage): 1930
Number of CD-HIT clusters (low coverage): 1596

To control the ordering of sequences, use cluster_order (default=None):

# - None: keep the algorithm's default order
df_clust = aa.filter_seq(df_seq=df_seq, cluster_order=None)
aa.display_df(df_clust, n_rows=-5)
# - "size": sort clusters by decreasing size
df_clust = aa.filter_seq(df_seq=df_seq, cluster_order="size")
aa.display_df(df_clust, n_rows=-5)
# - "input": restore original input order
df_clust = aa.filter_seq(df_seq=df_seq, cluster_order="input")
aa.display_df(df_clust, n_rows=-5)
  entry cluster identity_with_rep is_representative
1996 CAPSID_4517 1995 100.000000 1
1997 CAPSID_4516 1996 100.000000 1
1998 CAPSID_4300 1997 100.000000 1
1999 CAPSID_4108 1998 100.000000 1
2000 CAPSID_4984 1999 100.000000 1
  entry cluster identity_with_rep is_representative
1996 CAPSID_444 1995 100.000000 1
1997 CAPSID_4821 1996 100.000000 1
1998 CAPSID_530 1997 100.000000 1
1999 CAPSID_986 1998 100.000000 1
2000 CAPSID_4492 1999 100.000000 1
  entry cluster identity_with_rep is_representative
1996 CAPSID_5081 1975 100.000000 1
1997 CAPSID_5082 1445 100.000000 1
1998 CAPSID_5083 1148 100.000000 1
1999 CAPSID_5084 1733 100.000000 1
2000 CAPSID_5085 1802 100.000000 1

Multiprocessing can be enabled by using the n_jobs parameter, which is set to the maximum if n_jobs=None. However, this is only recommend for large datasets:

import time

# Run without multiprocessing
time_start = time.time()
df_clust = aa.filter_seq(df_seq=df_seq, n_jobs=1)
time_no_mp = round(time.time() - time_start, 2)
print(f"Time without multiprocessing: {time_no_mp} seconds")

# Run with multiprocessing
time_start = time.time()
df_clust = aa.filter_seq(df_seq=df_seq, n_jobs=3)
time_mp = round(time.time() - time_start, 2)
print(f"Time with multiprocessing. {time_mp} seconds")
Time without multiprocessing: 0.09 seconds
Time with multiprocessing. 0.13 seconds

Set verbose=True to show the direct messages of the algorithms during processing and in case of errors. This can be very detailed for MMSeqs2 and is therefore not demonstrated here.