aaanalysis.AAclust.filter_coverage

AAclust.filter_coverage(X, scale_ids=None, names_ref=None, min_coverage=100, df_cat=None, col_name='subcategory')[source]

Select a redundancy-reduced set of numerical scales with defined subcategory coverage.

This method reduces the number of numerical scales in the feature matrix X, while ensuring that the selected scales cover a minimum percentage (min_coverage) of subcategories.

The process involves clustering the scales in X and selecting one scale per cluster. The initial number of clusters is determined by the number of unique subcategories in names_ref. The number of clusters is increased step-wise until the overlap (coverage) between the unique elements in names_ref and the subcategories of the selected scales meets a defined threshold (min_coverage).

Parameters:
  • X (array-like, shape (n_scales, n_features)) – Feature matrix. Rows correspond to scales and columns to amino acids.

  • scale_ids (list of str) – List of scale IDs corresponding to the rows in X.

  • names_ref (list of str) – List of reference sample names (‘subcategories’) representing the desired subcategories for coverage. Must contain the same unique elements as the unique subcategories associated with scale_ids

  • min_coverage (int, default=100) – Minimum coverage percentage of unique subcategories to be achieved by the selected clusters.

  • df_cat (pd.DataFrame, optional) – DataFrame containing the categorical information for each scale. Should include columns scale_ids and the specified col_name. required columns are ‘scale_id’, ‘category’, ‘subcategory’, and ‘scale_name’.

  • col_name ({'category', 'subcategory', 'scale_name'}, default='subcategory') – Column name in df_cat that contains the subcategory information (alternatively, category or scale name).

Returns:

selected_scale_ids – List of selected scale ids that meet the minimum coverage criteria.

Return type:

list of str

See also

Examples

We load an example scale dataset to showcase the AAclust().filter_coverage() method:

import aaanalysis as aa
aa.options["verbose"] = False

# Create test dataset of 100 amino acid scales
df_scales = aa.load_scales()
scale_ids = list(df_scales.columns)[:100]
X = df_scales[scale_ids].T.values

# Create list of subcategories to be covered
df_cat = aa.load_scales(name="scales_cat")
names_ref = df_cat[df_cat["scale_id"].isin(scale_ids)]["subcategory"].tolist()

# Create AAclust model
aac = aa.AAclust()

aa.display_df(df_cat, n_rows=5)
  scale_id category subcategory scale_name scale_description
1 LINS030110 ASA/Volume Accessible surface area (ASA) ASA (folded coil/turn) Total median ac...s et al., 2003)
2 LINS030113 ASA/Volume Accessible surface area (ASA) ASA (folded coil/turn) % total accessi...s et al., 2003)
3 JANJ780101 ASA/Volume Accessible surface area (ASA) ASA (folded protein) Average accessi...n et al., 1978)
4 JANJ780103 ASA/Volume Accessible surface area (ASA) ASA (folded protein) Percentage of e...n et al., 1978)
5 LINS030104 ASA/Volume Accessible surface area (ASA) ASA (folded protein) Total median ac...s et al., 2003)

Internally, the methods fits the AAclust model and computes iteratively a subcategory coverage until a given threshold is reached. Provide the scales as feature matrix (X) and as id list (scale_ids), along with a category DataFrame (df_cat) and the list of subcategories to cover (names_ref):

# Select scales based on subcategory coverage
selected_scale_ids = aac.filter_coverage(X, scale_ids=scale_ids, df_cat=df_cat, names_ref=names_ref)
print(f"From 100 scales, the following {len(selected_scale_ids)} scales were selected:")
df_cat_selected = df_cat[df_cat["scale_id"].isin(selected_scale_ids)]

aa.display_df(df_cat_selected, show_shape=True, n_rows=5)
From 100 scales, the following 68 scales were selected:
DataFrame shape: (68, 5)
  scale_id category subcategory scale_name scale_description
7 CHOC760102 ASA/Volume Accessible surface area (ASA) ASA (folded proteins) Residue accessi...(Chothia, 1976)
24 BIOV880101 ASA/Volume Buried Buriability Information val...u et al., 1988)
28 ARGP820103 ASA/Volume Buried Buried Membrane-buried...s et al., 1982)
30 CHOC760104 ASA/Volume Buried Buried Proportion of r...(Chothia, 1976)
46 COHE430101 ASA/Volume Partial specific volume Partial specific volume Partial specifi...n-Edsall, 1943)

Adjust the subcategory coverage threshold using the min_coverage (default=100) parameter:

# Select scales using a subcategory coverage of 50%
selected_scale_ids = aac.filter_coverage(X, scale_ids=scale_ids, df_cat=df_cat, names_ref=names_ref, min_coverage=50)
print(f"From 100 scales, the following {len(selected_scale_ids)} scales were selected:")
df_cat_selected = df_cat[df_cat["scale_id"].isin(selected_scale_ids)]

aa.display_df(df_cat_selected, show_shape=True, n_rows=5)
From 100 scales, the following 41 scales were selected:
DataFrame shape: (41, 5)
  scale_id category subcategory scale_name scale_description
7 CHOC760102 ASA/Volume Accessible surface area (ASA) ASA (folded proteins) Residue accessi...(Chothia, 1976)
28 ARGP820103 ASA/Volume Buried Buried Membrane-buried...s et al., 1982)
46 COHE430101 ASA/Volume Partial specific volume Partial specific volume Partial specifi...n-Edsall, 1943)
48 CHOC760101 ASA/Volume Volume Accessible surface area (ASA) Residue accessi...(Chothia, 1976)
54 DAWD720101 ASA/Volume Volume Volume Size (Dawson, 1972)

Instead of using the ‘subcategory’ coverage, you can compute the coverage of ‘category’ or ‘scale_name’ by adjusting the col_name (default=‘subcategory’) parameter:

# Create list of categories to be covered
df_cat = aa.load_scales(name="scales_cat")
names_ref = df_cat[df_cat["scale_id"].isin(scale_ids)]["category"].tolist()

# Select scales using a category coverage of 100%
selected_scale_ids = aac.filter_coverage(X, scale_ids=scale_ids, df_cat=df_cat, names_ref=names_ref, col_name="category")
print(f"From 100 scales, the following {len(selected_scale_ids)} scales were selected:")
df_cat_selected = df_cat[df_cat["scale_id"].isin(selected_scale_ids)]

aa.display_df(df_cat_selected, show_shape=True, n_rows=5)
From 100 scales, the following 24 scales were selected:
DataFrame shape: (24, 5)
  scale_id category subcategory scale_name scale_description
28 ARGP820103 ASA/Volume Buried Buried Membrane-buried...s et al., 1982)
29 CHOC760103 ASA/Volume Buried Buried Proportion of r...(Chothia, 1976)
48 CHOC760101 ASA/Volume Volume Accessible surface area (ASA) Residue accessi...(Chothia, 1976)
65 DAYM780101 Composition AA composition AA composition Amino acid comp... et al., 1978a)
120 BULH740101 Composition MPs (anchor) TFE to surface Transfer free e...l-Breese, 1974)