AAclust.filter_coverage

AAclust.filter_coverage(X, scale_ids, names_ref, min_coverage=100, df_cat=None, col_name='subcategory')[source]

Select a redundancy-reduced set of numerical scales with defined subcategory coverage.

This method reduces the number of numerical scales in the feature matrix X, while ensuring that the selected scales cover a minimum percentage (min_coverage) of subcategories.

The process involves clustering the scales in X and selecting one scale per cluster. The initial number of clusters is determined by the number of unique subcategories in names_ref. The number of clusters is increased step-wise until the overlap (coverage) between the unique elements in names_ref and the subcategories of the selected scales meets a defined threshold (min_coverage).

Added in version 0.1.0.

Parameters:

X (array-like, shape (n_scales, n_features)) – Feature matrix. Rows correspond to scales and columns to amino acids.
scale_ids (list of str) – List of scale IDs corresponding to the rows in X.
names_ref (list of str) – List of reference sample names (‘subcategories’) representing the desired subcategories for coverage. Must contain the same unique elements as the unique subcategories associated with scale_ids
min_coverage (int, default=100) – Minimum coverage percentage of unique subcategories to be achieved by the selected clusters.
df_cat (pd.DataFrame) – DataFrame containing the categorical information for each scale. Should include columns scale_ids and the specified col_name. Required columns are ‘scale_id’, ‘category’, ‘subcategory’, and ‘scale_name’.
col_name ({'category', 'subcategory', 'scale_name'}, default='subcategory') – Column name in df_cat that contains the subcategory information (alternatively, category or scale name).

Returns:

selected_scale_ids – List of selected scale ids that meet the minimum coverage criteria.

Return type:

list of str

See also

AAclust.fit(): The clustering function used in every round for scale selection.
AAclust.comp_coverage(): The function used to compute the subcategory coverage.

Examples

We load an example scale dataset to showcase the AAclust().filter_coverage() method:

import aaanalysis as aa
aa.options["verbose"] = False

# Create test dataset of 100 amino acid scales
df_scales = aa.load_scales()
scale_ids = list(df_scales.columns)[:100]
X = df_scales[scale_ids].T.values

# Create list of subcategories to be covered
df_cat = aa.load_scales(name="scales_cat")
names_ref = df_cat[df_cat["scale_id"].isin(scale_ids)]["subcategory"].tolist()

# Create AAclust model
aac = aa.AAclust()

aa.display_df(df_cat, n_rows=5)

	scale_id	category	subcategory	scale_name	scale_description
1	LINS030110	ASA/Volume	Accessible surface area (ASA)	ASA (folded coil/turn)	Total median ac...s et al., 2003)
2	LINS030113	ASA/Volume	Accessible surface area (ASA)	ASA (folded coil/turn)	% total accessi...s et al., 2003)
3	JANJ780101	ASA/Volume	Accessible surface area (ASA)	ASA (folded protein)	Average accessi...n et al., 1978)
4	JANJ780103	ASA/Volume	Accessible surface area (ASA)	ASA (folded protein)	Percentage of e...n et al., 1978)
5	LINS030104	ASA/Volume	Accessible surface area (ASA)	ASA (folded protein)	Total median ac...s et al., 2003)

Internally, the methods fits the AAclust model and computes iteratively a subcategory coverage until a given threshold is reached. Provide the scales as feature matrix (X) and as id list (scale_ids), along with a category DataFrame (df_cat) and the list of subcategories to cover (names_ref):

# Select scales based on subcategory coverage
selected_scale_ids = aac.filter_coverage(X, scale_ids=scale_ids, df_cat=df_cat, names_ref=names_ref)
print(f"From 100 scales, the following {len(selected_scale_ids)} scales were selected:")
df_cat_selected = df_cat[df_cat["scale_id"].isin(selected_scale_ids)]

aa.display_df(df_cat_selected, show_shape=True, n_rows=5)

From 100 scales, the following 68 scales were selected:
DataFrame shape: (68, 5)

	scale_id	category	subcategory	scale_name	scale_description
7	CHOC760102	ASA/Volume	Accessible surface area (ASA)	ASA (folded proteins)	Residue accessi...(Chothia, 1976)
24	BIOV880101	ASA/Volume	Buried	Buriability	Information val...u et al., 1988)
28	ARGP820103	ASA/Volume	Buried	Buried	Membrane-buried...s et al., 1982)
30	CHOC760104	ASA/Volume	Buried	Buried	Proportion of r...(Chothia, 1976)
46	COHE430101	ASA/Volume	Partial specific volume	Partial specific volume	Partial specifi...n-Edsall, 1943)

Adjust the subcategory coverage threshold using the min_coverage (default=100) parameter:

# Select scales using a subcategory coverage of 50%
selected_scale_ids = aac.filter_coverage(X, scale_ids=scale_ids, df_cat=df_cat, names_ref=names_ref, min_coverage=50)
print(f"From 100 scales, the following {len(selected_scale_ids)} scales were selected:")
df_cat_selected = df_cat[df_cat["scale_id"].isin(selected_scale_ids)]

aa.display_df(df_cat_selected, show_shape=True, n_rows=5)

From 100 scales, the following 41 scales were selected:
DataFrame shape: (41, 5)

	scale_id	category	subcategory	scale_name	scale_description
7	CHOC760102	ASA/Volume	Accessible surface area (ASA)	ASA (folded proteins)	Residue accessi...(Chothia, 1976)
28	ARGP820103	ASA/Volume	Buried	Buried	Membrane-buried...s et al., 1982)
46	COHE430101	ASA/Volume	Partial specific volume	Partial specific volume	Partial specifi...n-Edsall, 1943)
48	CHOC760101	ASA/Volume	Volume	Accessible surface area (ASA)	Residue accessi...(Chothia, 1976)
54	DAWD720101	ASA/Volume	Volume	Volume	Size (Dawson, 1972)

Instead of using the ‘subcategory’ coverage, you can compute the coverage of ‘category’ or ‘scale_name’ by adjusting the col_name (default=‘subcategory’) parameter:

# Create list of categories to be covered
df_cat = aa.load_scales(name="scales_cat")
names_ref = df_cat[df_cat["scale_id"].isin(scale_ids)]["category"].tolist()

# Select scales using a category coverage of 100%
selected_scale_ids = aac.filter_coverage(X, scale_ids=scale_ids, df_cat=df_cat, names_ref=names_ref, col_name="category")
print(f"From 100 scales, the following {len(selected_scale_ids)} scales were selected:")
df_cat_selected = df_cat[df_cat["scale_id"].isin(selected_scale_ids)]

aa.display_df(df_cat_selected, show_shape=True, n_rows=5)

From 100 scales, the following 24 scales were selected:
DataFrame shape: (24, 5)

	scale_id	category	subcategory	scale_name	scale_description
28	ARGP820103	ASA/Volume	Buried	Buried	Membrane-buried...s et al., 1982)
29	CHOC760103	ASA/Volume	Buried	Buried	Proportion of r...(Chothia, 1976)
48	CHOC760101	ASA/Volume	Volume	Accessible surface area (ASA)	Residue accessi...(Chothia, 1976)
65	DAYM780101	Composition	AA composition	AA composition	Amino acid comp... et al., 1978a)
120	BULH740101	Composition	MPs (anchor)	TFE to surface	Transfer free e...l-Breese, 1974)