aaanalysis.filter_seq
- class aaanalysis.filter_seq(df_seq=None, method='cd-hit', similarity_threshold=0.9, word_size=None, global_identity=True, coverage_long=None, coverage_short=None, cluster_order=None, n_jobs=1, verbose=False)[source]
Bases:
Redundancy reduction of sequences using clustering-based algorithms.
This functions performs redundancy reduction of sequences by clustering and selecting representative sequences using the CD-HIT ([Li06]) or MMseqs2 ([Steinegger17]) algorithms locally. It allows for adjustable filtering strictness:
Strict filtering results in smaller, more homogeneous clusters, suitable when high sequence similarity is required.
Non-strict filtering creates larger, more diverse clusters, enhancing sequence representation.
CD-HIT and MMseq2 are standalone software tools, each requiring separate installation. CD-Hit is more resource-efficient and easier to install, while MMseq2 is a larger multi-purpose tool. Pairwise sequence similarities for the MMseq2 clustering results were computed using the Biopython
Bio.Align.PairwiseAlignerclass.Added in version 1.0.0.
- Parameters:
df_seq (pd.DataFrame, shape (n_samples, n>=1)) – DataFrame containing an
entrycolumn with unique protein identifiers and asequencecolumn with full protein sequences. Sequence length must be at least 11 amino acids and the sequence should not contain any gaps (‘-‘).method ({'cd-hit', 'mmseqs'}, default='cd-hit') –
Specifies the clustering algorithm to use:
cd-hit: Efficiently clusters sequences, ideal for small- to medium-sized datasets.mmseqs: Advanced algorithm designed for large-scale sequence analysis, offering high accuracy.
similarity_threshold (float, default=0.9) – Defines the minimum sequence identity [0.4-1.0] for clustering. Higher values increase strictness.
word_size (int, optional) – The size of the ‘word’ (in CD-HIT, [2-5]) or ‘k-mer’ (in MMseqs, [5-7]) used for the initial screening step in clustering. Effect on strictness is dataset-dependent. If
None, optimized based onsimilarity_threshold(CD-Hit).global_identity (bool, default=True) – Whether to use global (True) or local (False) sequence identity for ‘cd-hit’ clustering. Global is stricter. 80%-coverage is used for local ‘cd-hit’ clustering if not specified. MMseq2 uses only local alignments.
coverage_long (float, optional) – Minimum percentage [0.1-1.0] of the longer sequence that must be included in the alignment. Higher values increase strictness.
coverage_short (float, optional) – Minimum percentage [0.1-1.0] of the shorter sequence that must be included in the alignment. Higher values increase strictness.
cluster_order ({None, 'size', 'input'}, default=None) –
Defines the ordering of sequences in the output DataFrame:
None: preserve the default order returned by the clustering algorithm.'size': group clusters and sort them by decreasing cluster size.'input': reorder sequences to match the original input order.
Use
'input'when clustering results should remain aligned with downstream analyses based on the original sequence order.n_jobs (int, None, or -1, default=1) – Number of CPU cores used for multiprocessing. If
-1orNone, the number is set to all available cores.verbose (bool, default=False) – If
True, verbose outputs are enabled.
- Returns:
df_clust – A DataFrame with clustering results.
- Return type:
pd.DataFrame
Notes
CD-HIT and MMseqs2 use different approaches for clustering sequences:
CD-HIT sorts sequences by length and clusters them using global or local alignments against the longest sequence.
MMseqs2 employs an index-based approach and optimized algorithms for faster and more sensitive data handling.
Parameter Comparison:
Parameter
CD-HIT
MMseqs2
similarity_threshold
-c (sequence identity)
–min-seq-id (minimum sequence id)
word_size
-n (word length)
-k (k-mer size, auto-optimized)
coverage_Long
-aL (coverage of longer seq)
–cov-mode 0 -c (bidirectional)
coverage_short
-aS (coverage of shorter seq)
–cov-mode 1 -c (target coverage)
See also
MMseqs2 GitHub ReadMe and GitHub Wiki.
Comparison of CD-Hit and MMseqs2 parameters under Frequently Asked Questions.
Warning
This function requires biopython, which is automatically installed via pip install aaanalysis[pro].
CD-HIT and MMseq2 must be installed separately.
CD-HIT is not available for Windows.
Examples
To demonstrate the
filter_seq()function, we load theSEQ_CAPSIDexample dataset (see [Breimann24a]):import aaanalysis as aa aa.options["verbose"] = False df_seq = aa.load_dataset(name="SEQ_CAPSID", n=1000)
The
filter_seq()function is a Python wrapper for two different sequence clustering and filtering algorithms, which have to be installed independently of theAAanaylsispackage. Select them by setting themethodparameter tocd-hit(default) ormmseqs:# Filtering using CD-HIT (default) df_clust = aa.filter_seq(df_seq=df_seq) n_clust = df_clust["cluster"].nunique() print(f"Number of CD-HIT clusters: {n_clust}") aa.display_df(df_clust, n_rows=-5, show_shape=True) # Filtering using MMSeqs df_clust = aa.filter_seq(df_seq=df_seq, method="mmseqs") n_clust = df_clust["cluster"].nunique() print(f"Number of MMSeqs2 clusters: {n_clust}")Number of CD-HIT clusters: 2000 DataFrame shape: (2000, 4)
entry cluster identity_with_rep is_representative 1996 CAPSID_4517 1995 100.000000 1 1997 CAPSID_4516 1996 100.000000 1 1998 CAPSID_4300 1997 100.000000 1 1999 CAPSID_4108 1998 100.000000 1 2000 CAPSID_4984 1999 100.000000 1 Number of MMSeqs2 clusters: 1998
You can obtain a redundancy-reduced set of protein sequences by selecting the representative sequence of each cluster:
# Select redundancy-reduced sequences df_selected = df_clust[df_clust["is_representative"] == 1] aa.display_df(df_selected, n_rows=-5, show_shape=True)
DataFrame shape: (1998, 4)
entry cluster identity_with_rep is_representative 1996 CAPSID_4936 1993 100.000000 1 1997 CAPSID_4968 1994 100.000000 1 1998 CAPSID_5002 1995 100.000000 1 1999 CAPSID_5037 1996 100.000000 1 2000 CAPSID_5069 1997 100.000000 1 To reduce the number of clusters, you can decrease the sequence
similarity_threshold(default=0.9):# Decrease number of clusters by using lower sequence similarity df_clust = aa.filter_seq(df_seq=df_seq, similarity_threshold=0.5) n_clust = df_clust["cluster"].nunique() print(f"Number of CD-HIT clusters: {n_clust}") # Filtering with MMSeqs df_clust = aa.filter_seq(df_seq=df_seq, method="mmseqs", similarity_threshold=0.5) n_clust = df_clust["cluster"].nunique() print(f"Number of MMSeqs2 clusters: {n_clust}")Number of CD-HIT clusters: 1584 Number of MMSeqs2 clusters: 1603
Adjust the length of the subsequence (called ‘word’ or ‘k-mers’) using the
word_sizeparameter, which is optimized by default depending on the similarity threshold:df_clust = aa.filter_seq(df_seq=df_seq, similarity_threshold=0.5, word_size=2) n_clust = df_clust["cluster"].nunique() print(f"Number of CD-HIT clusters: {n_clust}")Number of CD-HIT clusters: 1584
For
cd-hit, you can change to local (less strict) sequence identity by settingglobal_identity=False:# Clustering with local sequence identity df_clust = aa.filter_seq(df_seq=df_seq, similarity_threshold=0.5, global_identity=False) n_clust = df_clust["cluster"].nunique() print(f"Number of CD-HIT clusters: {n_clust}")Number of CD-HIT clusters: 1627
The minimum coverage of the longer and shorter sequence can be adjusted using the
coverage_longandcoverage_shortparameters:# Clustering with the highest sequence coverage df_clust = aa.filter_seq(df_seq=df_seq, similarity_threshold=0.5, coverage_long=1, coverage_short=1) n_clust = df_clust["cluster"].nunique() print(f"Number of CD-HIT clusters (high coverage): {n_clust}") # Clustering with the lowest sequence coverage df_clust = aa.filter_seq(df_seq=df_seq, similarity_threshold=0.5, coverage_long=0.1, coverage_short=0.1) n_clust = df_clust["cluster"].nunique() print(f"Number of CD-HIT clusters (low coverage): {n_clust}")Number of CD-HIT clusters (high coverage): 1930 Number of CD-HIT clusters (low coverage): 1596
To control the ordering of sequences, use
cluster_order(default=None):# - None: keep the algorithm's default order df_clust = aa.filter_seq(df_seq=df_seq, cluster_order=None) aa.display_df(df_clust, n_rows=-5) # - "size": sort clusters by decreasing size df_clust = aa.filter_seq(df_seq=df_seq, cluster_order="size") aa.display_df(df_clust, n_rows=-5) # - "input": restore original input order df_clust = aa.filter_seq(df_seq=df_seq, cluster_order="input") aa.display_df(df_clust, n_rows=-5)
entry cluster identity_with_rep is_representative 1996 CAPSID_4517 1995 100.000000 1 1997 CAPSID_4516 1996 100.000000 1 1998 CAPSID_4300 1997 100.000000 1 1999 CAPSID_4108 1998 100.000000 1 2000 CAPSID_4984 1999 100.000000 1 entry cluster identity_with_rep is_representative 1996 CAPSID_444 1995 100.000000 1 1997 CAPSID_4821 1996 100.000000 1 1998 CAPSID_530 1997 100.000000 1 1999 CAPSID_986 1998 100.000000 1 2000 CAPSID_4492 1999 100.000000 1 entry cluster identity_with_rep is_representative 1996 CAPSID_5081 1975 100.000000 1 1997 CAPSID_5082 1445 100.000000 1 1998 CAPSID_5083 1148 100.000000 1 1999 CAPSID_5084 1733 100.000000 1 2000 CAPSID_5085 1802 100.000000 1 Multiprocessing can be enabled by using the
n_jobsparameter, which is set to the maximum ifn_jobs=None. However, this is only recommend for large datasets:import time # Run without multiprocessing time_start = time.time() df_clust = aa.filter_seq(df_seq=df_seq, n_jobs=1) time_no_mp = round(time.time() - time_start, 2) print(f"Time without multiprocessing: {time_no_mp} seconds") # Run with multiprocessing time_start = time.time() df_clust = aa.filter_seq(df_seq=df_seq, n_jobs=3) time_mp = round(time.time() - time_start, 2) print(f"Time with multiprocessing. {time_mp} seconds")Time without multiprocessing: 0.09 seconds Time with multiprocessing. 0.13 seconds
Set
verbose=Trueto show the direct messages of the algorithms during processing and in case of errors. This can be very detailed for MMSeqs2 and is therefore not demonstrated here.