aaanalysis.filter_seq
- class aaanalysis.filter_seq(df_seq=None, method='cd-hit', similarity_threshold=0.9, word_size=None, global_identity=True, coverage_long=None, coverage_short=None, cluster_order=None, n_jobs=1, verbose=False)[source]
Bases:
Redundancy reduction of sequences using clustering-based algorithms.
This functions performs redundancy reduction of sequences by clustering and selecting representative sequences using the CD-HIT ([Li06]) or MMseqs2 ([Steinegger17]) algorithms locally. It allows for adjustable filtering strictness:
Strict filtering results in smaller, more homogeneous clusters, suitable when high sequence similarity is required.
Non-strict filtering creates larger, more diverse clusters, enhancing sequence representation.
CD-HIT and MMseq2 are standalone software tools, each requiring separate installation. CD-Hit is more resource-efficient and easier to install, while MMseq2 is a larger multi-purpose tool. Pairwise sequence similarities for the MMseq2 clustering results were computed using the Biopython
Bio.Align.PairwiseAlignerclass.- Parameters:
df_seq (pd.DataFrame, shape (n_samples, n>=1)) – DataFrame containing an
entryand ‘’sequence’’ column for unique identifiers and sequences. Sequence length must be at least 11 amino acids and the sequence should not contain any gaps (‘-‘).method ({'cd-hit', 'mmseqs'}, default='cd-hit') –
Specifies the clustering algorithm to use:
cd-hit: Efficiently clusters sequences, ideal for small- to medium-sized datasets.mmseqs: Advanced algorithm designed for large-scale sequence analysis, offering high accuracy.
similarity_threshold (float, default=0.9) – Defines the minimum sequence identity [0.4-1.0] for clustering. Higher values increase strictness.
word_size (int, optional) – The size of the ‘word’ (in CD-HIT, [2-5]) or ‘k-mer’ (in MMseqs, [5-7]) used for the initial screening step in clustering. Effect on strictness is dataset-dependent. If
None, optimized based onsimilarity_threshold(CD-Hit).global_identity (bool, default=True) – Whether to use global (True) or local (False) sequence identity for ‘cd-hit’ clustering. Global is stricter. 80%-coverage is used for local ‘cd-hit’ clustering if not specified. MMseq2 uses only local alignments.
coverage_long (float, optional) – Minimum percentage [0.1-1.0] of the longer sequence that must be included in the alignment. Higher values increase strictness.
coverage_short (float, optional) – Minimum percentage [0.1-1.0] of the shorter sequence that must be included in the alignment. Higher values increase strictness.
cluster_order ({None, 'size', 'input'}, default=None) –
Defines the ordering of sequences in the output DataFrame:
None: preserve the default order returned by the clustering algorithm.'size': group clusters and sort them by decreasing cluster size.'input': reorder sequences to match the original input order.
Use
'input'when clustering results should remain aligned with downstream analyses based on the original sequence order.n_jobs (int, None, or -1, default=1) – Number of CPU cores used for multiprocessing. If
-1orNone, the number is set to all available cores.verbose (bool, default=False) – If
True, verbose outputs are enabled.
- Returns:
df_clust – A DataFrame with clustering results.
- Return type:
pd.DataFrame
Notes
CD-HIT and MMseqs2 use different approaches for clustering sequences:
CD-HIT sorts sequences by length and clusters them using global or local alignments against the longest sequence.
MMseqs2 employs an index-based approach and optimized algorithms for faster and more sensitive data handling.
Parameter Comparison:
Parameter
CD-HIT
MMseqs2
similarity_threshold
-c (sequence identity)
–min-seq-id (minimum sequence id)
word_size
-n (word length)
-k (k-mer size, auto-optimized)
coverage_Long
-aL (coverage of longer seq)
–cov-mode 0 -c (bidirectional)
coverage_short
-aS (coverage of shorter seq)
–cov-mode 1 -c (target coverage)
See also
MMseqs2 GitHub ReadMe and GitHub Wiki.
Comparison of CD-Hit and MMseqs2 parameters under Frequently Asked Questions.
Warning
This function requires biopython, which is automatically installed via pip install aaanalysis[pro].
CD-HIT and MMseq2 must be installed separately.
CD-HIT is not available for Windows.
Examples
To demonstrate the
filter_seq()function, we load theSEQ_CAPSIDexample dataset (see [Breimann24a]):import aaanalysis as aa aa.options["verbose"] = False df_seq = aa.load_dataset(name="SEQ_CAPSID", n=1000)
The
filter_seq()function is a Python wrapper for two different sequence clustering and filtering algorithms, which have to be installed independently of theAAanaylsispackage. Select them by setting themethodparameter tocd-hit(default) ormmseqs:# Filtering using CD-HIT (default) df_clust = aa.filter_seq(df_seq=df_seq) n_clust = df_clust["cluster"].nunique() print(f"Number of CD-HIT clusters: {n_clust}") aa.display_df(df_clust, n_rows=-5, show_shape=True) # Filtering using MMSeqs df_clust = aa.filter_seq(df_seq=df_seq, method="mmseqs") n_clust = df_clust["cluster"].nunique() print(f"Number of MMSeqs2 clusters: {n_clust}")Number of CD-HIT clusters: 2000 DataFrame shape: (2000, 4)
entry cluster identity_with_rep is_representative 1996 CAPSID_4517 1995 100.000000 1 1997 CAPSID_4516 1996 100.000000 1 1998 CAPSID_4300 1997 100.000000 1 1999 CAPSID_4108 1998 100.000000 1 2000 CAPSID_4984 1999 100.000000 1 Number of MMSeqs2 clusters: 1998
You can obtain a redundancy-reduced set of protein sequences by selecting the representative sequence of each cluster:
# Select redundancy-reduced sequences df_selected = df_clust[df_clust["is_representative"] == 1] aa.display_df(df_selected, n_rows=-5, show_shape=True)
DataFrame shape: (1998, 4)
entry cluster identity_with_rep is_representative 1996 CAPSID_4936 1993 100.000000 1 1997 CAPSID_4968 1994 100.000000 1 1998 CAPSID_5002 1995 100.000000 1 1999 CAPSID_5037 1996 100.000000 1 2000 CAPSID_5069 1997 100.000000 1 To reduce the number of clusters, you can decrease the sequence
similarity_threshold(default=0.9):# Decrease number of clusters by using lower sequence similarity df_clust = aa.filter_seq(df_seq=df_seq, similarity_threshold=0.5) n_clust = df_clust["cluster"].nunique() print(f"Number of CD-HIT clusters: {n_clust}") # Filtering with MMSeqs df_clust = aa.filter_seq(df_seq=df_seq, method="mmseqs", similarity_threshold=0.5) n_clust = df_clust["cluster"].nunique() print(f"Number of MMSeqs2 clusters: {n_clust}")Number of CD-HIT clusters: 1584 Number of MMSeqs2 clusters: 1603
Adjust the length of the subsequence (called ‘word’ or ‘k-mers’) using the
word_sizeparameter, which is optimized by default depending on the similarity threshold:df_clust = aa.filter_seq(df_seq=df_seq, similarity_threshold=0.5, word_size=2) n_clust = df_clust["cluster"].nunique() print(f"Number of CD-HIT clusters: {n_clust}")Number of CD-HIT clusters: 1584
For
cd-hit, you can change to local (less strict) sequence identity by settingglobal_identity=False:# Clustering with local sequence identity df_clust = aa.filter_seq(df_seq=df_seq, similarity_threshold=0.5, global_identity=False) n_clust = df_clust["cluster"].nunique() print(f"Number of CD-HIT clusters: {n_clust}")Number of CD-HIT clusters: 1627
The minimum coverage of the longer and shorter sequence can be adjusted using the
coverage_longandcoverage_shortparameters:# Clustering with the highest sequence coverage df_clust = aa.filter_seq(df_seq=df_seq, similarity_threshold=0.5, coverage_long=1, coverage_short=1) n_clust = df_clust["cluster"].nunique() print(f"Number of CD-HIT clusters (high coverage): {n_clust}") # Clustering with the lowest sequence coverage df_clust = aa.filter_seq(df_seq=df_seq, similarity_threshold=0.5, coverage_long=0.1, coverage_short=0.1) n_clust = df_clust["cluster"].nunique() print(f"Number of CD-HIT clusters (low coverage): {n_clust}")Number of CD-HIT clusters (high coverage): 1930 Number of CD-HIT clusters (low coverage): 1596
To control the ordering of sequences, use
cluster_order(default=None):# - None: keep the algorithm's default order df_clust = aa.filter_seq(df_seq=df_seq, cluster_order=None) aa.display_df(df_clust, n_rows=-5) # - "size": sort clusters by decreasing size df_clust = aa.filter_seq(df_seq=df_seq, cluster_order="size") aa.display_df(df_clust, n_rows=-5) # - "input": restore original input order df_clust = aa.filter_seq(df_seq=df_seq, cluster_order="input") aa.display_df(df_clust, n_rows=-5)
entry cluster identity_with_rep is_representative 1996 CAPSID_4517 1995 100.000000 1 1997 CAPSID_4516 1996 100.000000 1 1998 CAPSID_4300 1997 100.000000 1 1999 CAPSID_4108 1998 100.000000 1 2000 CAPSID_4984 1999 100.000000 1 entry cluster identity_with_rep is_representative 1996 CAPSID_444 1995 100.000000 1 1997 CAPSID_4821 1996 100.000000 1 1998 CAPSID_530 1997 100.000000 1 1999 CAPSID_986 1998 100.000000 1 2000 CAPSID_4492 1999 100.000000 1 entry cluster identity_with_rep is_representative 1996 CAPSID_5081 1975 100.000000 1 1997 CAPSID_5082 1445 100.000000 1 1998 CAPSID_5083 1148 100.000000 1 1999 CAPSID_5084 1733 100.000000 1 2000 CAPSID_5085 1802 100.000000 1 Multiprocessing can be enabled by using the
n_jobsparameter, which is set to the maximum ifn_jobs=None. However, this is only recommend for large datasets:import time # Run without multiprocessing time_start = time.time() df_clust = aa.filter_seq(df_seq=df_seq, n_jobs=1) time_no_mp = round(time.time() - time_start, 2) print(f"Time without multiprocessing: {time_no_mp} seconds") # Run with multiprocessing time_start = time.time() df_clust = aa.filter_seq(df_seq=df_seq, n_jobs=3) time_mp = round(time.time() - time_start, 2) print(f"Time with multiprocessing. {time_mp} seconds")Time without multiprocessing: 0.09 seconds Time with multiprocessing. 0.13 seconds
Set
verbose=Trueto show the direct messages of the algorithms during processing and in case of errors. This can be very detailed for MMSeqs2 and is therefore not demonstrated here.