aaanalysis.filter_seq

class aaanalysis.filter_seq(df_seq=None, method='cd-hit', similarity_threshold=0.9, word_size=None, global_identity=True, coverage_long=None, coverage_short=None, cluster_order=None, n_jobs=1, verbose=False)[source]

Bases:

Redundancy reduction of sequences using clustering-based algorithms.

This functions performs redundancy reduction of sequences by clustering and selecting representative sequences using the CD-HIT ([Li06]) or MMseqs2 ([Steinegger17]) algorithms locally. It allows for adjustable filtering strictness:

Strict filtering results in smaller, more homogeneous clusters, suitable when high sequence similarity is required.
Non-strict filtering creates larger, more diverse clusters, enhancing sequence representation.

CD-HIT and MMseq2 are standalone software tools, each requiring separate installation. CD-Hit is more resource-efficient and easier to install, while MMseq2 is a larger multi-purpose tool. Pairwise sequence similarities for the MMseq2 clustering results were computed using the Biopython Bio.Align.PairwiseAligner class.

Parameters:

df_seq (pd.DataFrame, shape (n_samples, n>=1)) – DataFrame containing an entry and ‘’sequence’’ column for unique identifiers and sequences. Sequence length must be at least 11 amino acids and the sequence should not contain any gaps (‘-‘).
method ({'cd-hit', 'mmseqs'}, default='cd-hit') –
Specifies the clustering algorithm to use:
- cd-hit: Efficiently clusters sequences, ideal for small- to medium-sized datasets.
- mmseqs: Advanced algorithm designed for large-scale sequence analysis, offering high accuracy.
similarity_threshold (float, default=0.9) – Defines the minimum sequence identity [0.4-1.0] for clustering. Higher values increase strictness.
word_size (int, optional) – The size of the ‘word’ (in CD-HIT, [2-5]) or ‘k-mer’ (in MMseqs, [5-7]) used for the initial screening step in clustering. Effect on strictness is dataset-dependent. If None, optimized based on similarity_threshold (CD-Hit).
global_identity (bool, default=True) – Whether to use global (True) or local (False) sequence identity for ‘cd-hit’ clustering. Global is stricter. 80%-coverage is used for local ‘cd-hit’ clustering if not specified. MMseq2 uses only local alignments.
coverage_long (float, optional) – Minimum percentage [0.1-1.0] of the longer sequence that must be included in the alignment. Higher values increase strictness.
coverage_short (float, optional) – Minimum percentage [0.1-1.0] of the shorter sequence that must be included in the alignment. Higher values increase strictness.
cluster_order ({None, 'size', 'input'}, default=None) –
Defines the ordering of sequences in the output DataFrame:
- None: preserve the default order returned by the clustering algorithm.
- 'size': group clusters and sort them by decreasing cluster size.
- 'input': reorder sequences to match the original input order.
Use 'input' when clustering results should remain aligned with downstream analyses based on the original sequence order.
n_jobs (int, None, or -1, default=1) – Number of CPU cores used for multiprocessing. If -1 or None, the number is set to all available cores.
verbose (bool, default=False) – If True, verbose outputs are enabled.

Returns:

df_clust – A DataFrame with clustering results.

Return type:

pd.DataFrame

Notes

CD-HIT and MMseqs2 use different approaches for clustering sequences:
- CD-HIT sorts sequences by length and clusters them using global or local alignments against the longest sequence.
- MMseqs2 employs an index-based approach and optimized algorithms for faster and more sensitive data handling.
Parameter Comparison:

Parameter	CD-HIT	MMseqs2
similarity_threshold	-c (sequence identity)	–min-seq-id (minimum sequence id)
word_size	-n (word length)	-k (k-mer size, auto-optimized)
coverage_Long	-aL (coverage of longer seq)	–cov-mode 0 -c (bidirectional)
coverage_short	-aS (coverage of shorter seq)	–cov-mode 1 -c (target coverage)

See also

CD-HIT Documentation.
MMseqs2 GitHub ReadMe and GitHub Wiki.
Comparison of CD-Hit and MMseqs2 parameters under Frequently Asked Questions.

Warning

This function requires biopython, which is automatically installed via pip install aaanalysis[pro].
CD-HIT and MMseq2 must be installed separately.
CD-HIT is not available for Windows.

Examples

To demonstrate the filter_seq() function, we load the SEQ_CAPSID example dataset (see [Breimann24a]):

import aaanalysis as aa
aa.options["verbose"] = False
df_seq = aa.load_dataset(name="SEQ_CAPSID", n=1000)

The filter_seq() function is a Python wrapper for two different sequence clustering and filtering algorithms, which have to be installed independently of the AAanaylsis package. Select them by setting the method parameter to cd-hit (default) or mmseqs:

# Filtering using CD-HIT (default)
df_clust = aa.filter_seq(df_seq=df_seq)
n_clust = df_clust["cluster"].nunique()
print(f"Number of CD-HIT clusters: {n_clust}")
aa.display_df(df_clust, n_rows=-5, show_shape=True)

# Filtering using MMSeqs
df_clust = aa.filter_seq(df_seq=df_seq, method="mmseqs")
n_clust = df_clust["cluster"].nunique()
print(f"Number of MMSeqs2 clusters: {n_clust}")

Number of CD-HIT clusters: 2000
DataFrame shape: (2000, 4)

	entry	cluster	identity_with_rep	is_representative
1996	CAPSID_4517	1995	100.000000	1
1997	CAPSID_4516	1996	100.000000	1
1998	CAPSID_4300	1997	100.000000	1
1999	CAPSID_4108	1998	100.000000	1
2000	CAPSID_4984	1999	100.000000	1

Number of MMSeqs2 clusters: 1998

You can obtain a redundancy-reduced set of protein sequences by selecting the representative sequence of each cluster:

# Select redundancy-reduced sequences
df_selected = df_clust[df_clust["is_representative"] == 1]
aa.display_df(df_selected, n_rows=-5, show_shape=True)

DataFrame shape: (1998, 4)

	entry	cluster	identity_with_rep	is_representative
1996	CAPSID_4936	1993	100.000000	1
1997	CAPSID_4968	1994	100.000000	1
1998	CAPSID_5002	1995	100.000000	1
1999	CAPSID_5037	1996	100.000000	1
2000	CAPSID_5069	1997	100.000000	1

To reduce the number of clusters, you can decrease the sequence similarity_threshold (default=0.9):

# Decrease number of clusters by using lower sequence similarity
df_clust = aa.filter_seq(df_seq=df_seq, similarity_threshold=0.5)
n_clust = df_clust["cluster"].nunique()
print(f"Number of CD-HIT clusters: {n_clust}")

# Filtering with MMSeqs
df_clust = aa.filter_seq(df_seq=df_seq, method="mmseqs", similarity_threshold=0.5)
n_clust = df_clust["cluster"].nunique()
print(f"Number of MMSeqs2 clusters: {n_clust}")

Number of CD-HIT clusters: 1584
Number of MMSeqs2 clusters: 1603

Adjust the length of the subsequence (called ‘word’ or ‘k-mers’) using the word_size parameter, which is optimized by default depending on the similarity threshold:

df_clust = aa.filter_seq(df_seq=df_seq, similarity_threshold=0.5, word_size=2)
n_clust = df_clust["cluster"].nunique()
print(f"Number of CD-HIT clusters: {n_clust}")

Number of CD-HIT clusters: 1584

For cd-hit, you can change to local (less strict) sequence identity by setting global_identity=False:

# Clustering with local sequence identity
df_clust = aa.filter_seq(df_seq=df_seq, similarity_threshold=0.5, global_identity=False)
n_clust = df_clust["cluster"].nunique()
print(f"Number of CD-HIT clusters: {n_clust}")

Number of CD-HIT clusters: 1627

The minimum coverage of the longer and shorter sequence can be adjusted using the coverage_long and coverage_short parameters:

# Clustering with the highest sequence coverage
df_clust = aa.filter_seq(df_seq=df_seq, similarity_threshold=0.5, coverage_long=1, coverage_short=1)
n_clust = df_clust["cluster"].nunique()
print(f"Number of CD-HIT clusters (high coverage): {n_clust}")

# Clustering with the lowest sequence coverage
df_clust = aa.filter_seq(df_seq=df_seq, similarity_threshold=0.5, coverage_long=0.1, coverage_short=0.1)
n_clust = df_clust["cluster"].nunique()
print(f"Number of CD-HIT clusters (low coverage): {n_clust}")

Number of CD-HIT clusters (high coverage): 1930
Number of CD-HIT clusters (low coverage): 1596

To control the ordering of sequences, use cluster_order (default=None):

# - None: keep the algorithm's default order
df_clust = aa.filter_seq(df_seq=df_seq, cluster_order=None)
aa.display_df(df_clust, n_rows=-5)
# - "size": sort clusters by decreasing size
df_clust = aa.filter_seq(df_seq=df_seq, cluster_order="size")
aa.display_df(df_clust, n_rows=-5)
# - "input": restore original input order
df_clust = aa.filter_seq(df_seq=df_seq, cluster_order="input")
aa.display_df(df_clust, n_rows=-5)

	entry	cluster	identity_with_rep	is_representative
1996	CAPSID_4517	1995	100.000000	1
1997	CAPSID_4516	1996	100.000000	1
1998	CAPSID_4300	1997	100.000000	1
1999	CAPSID_4108	1998	100.000000	1
2000	CAPSID_4984	1999	100.000000	1

	entry	cluster	identity_with_rep	is_representative
1996	CAPSID_444	1995	100.000000	1
1997	CAPSID_4821	1996	100.000000	1
1998	CAPSID_530	1997	100.000000	1
1999	CAPSID_986	1998	100.000000	1
2000	CAPSID_4492	1999	100.000000	1

	entry	cluster	identity_with_rep	is_representative
1996	CAPSID_5081	1975	100.000000	1
1997	CAPSID_5082	1445	100.000000	1
1998	CAPSID_5083	1148	100.000000	1
1999	CAPSID_5084	1733	100.000000	1
2000	CAPSID_5085	1802	100.000000	1

Multiprocessing can be enabled by using the n_jobs parameter, which is set to the maximum if n_jobs=None. However, this is only recommend for large datasets:

import time

# Run without multiprocessing
time_start = time.time()
df_clust = aa.filter_seq(df_seq=df_seq, n_jobs=1)
time_no_mp = round(time.time() - time_start, 2)
print(f"Time without multiprocessing: {time_no_mp} seconds")

# Run with multiprocessing
time_start = time.time()
df_clust = aa.filter_seq(df_seq=df_seq, n_jobs=3)
time_mp = round(time.time() - time_start, 2)
print(f"Time with multiprocessing. {time_mp} seconds")

Time without multiprocessing: 0.09 seconds
Time with multiprocessing. 0.13 seconds

Set verbose=True to show the direct messages of the algorithms during processing and in case of errors. This can be very detailed for MMSeqs2 and is therefore not demonstrated here.