aaanalysis.SequenceFeature.feature_matrix

SequenceFeature.feature_matrix(features=None, df_parts=None, df_scales=None, accept_gaps=False, n_jobs=1)[source]

Create feature matrix for given feature ids and sequence parts.

Parameters:

features (array-like, shape (n_features,)) – Ids of features for which matrix of feature values should be created.
df_parts (pd.DataFrame, shape (n_samples, n_parts)) – DataFrame with sequence parts.
df_scales (pd.DataFrame, shape (n_letters, n_scales), optional) – DataFrame of scales with letters typically representing amino acids. Default from load_scales() unless specified in options['df_scales'].
accept_gaps (bool, default=False) – Whether to accept missing values by enabling omitting for computations (if True).
n_jobs (int, None, or -1, default=1) – Number of CPU cores (>=1) used for multiprocessing. If None, the number is optimized automatically. If -1, the number is set to all available cores.

Returns:

X – Feature matrix containing feature values for samples.

Return type:

array-like , shape (n_samples, n_features)

Notes

Use parallel processing only for high number of features (>~1000 features per core)

Examples

To demonstrate the SequenceFeature().feature_matrix() method, we load the DOM_GSEC example dataset including its respective features (see [Breimann25a]):

import aaanalysis as aa
aa.options["verbose"] = False
df_seq = aa.load_dataset(name="DOM_GSEC")
labels = df_seq["label"].to_list()
df_feat = aa.load_features(name="DOM_GSEC")
features = df_feat["feature"].to_list()
sf = aa.SequenceFeature()
df_parts = sf.get_df_parts(df_seq=df_seq)

features and df_parts must be provided to retrieve the feature matrix:

X = sf.feature_matrix(features=features, df_parts=df_parts)
print(f"n samples: {len(df_parts)}")
print(f"n features: {len(features)}")
# X has a shape of n_samples, n_features
print(f"Shape of X: {X.shape}")

n samples: 126
n features: 150
Shape of X: (126, 150)

If sequences in df_parts, you can enable accept_gaps so that the feature values are computed as the average of the part-split combination ignoring gaps.

X = sf.feature_matrix(features=features, df_parts=df_parts, accept_gaps=True)

Multiprocessing can be used by using the n_jobs parameter, which is set to the maximum if n_jobs=None. However, this is only recommend for more than ~1000 features per core due to potential process management overhead.

import time

# Run without multiprocessing
time_start = time.time()
X = sf.feature_matrix(features=features, df_parts=df_parts)
time_no_mp = round(time.time() - time_start, 2)
print(f"Time without multiprocessing: {time_no_mp} seconds")

# Run with multiprocessing
time_start = time.time()
X = sf.feature_matrix(features=features, df_parts=df_parts, n_jobs=None)
time_mp = round(time.time() - time_start, 2)
print(f"Time with multiprocessing. {time_mp} seconds")

Time without multiprocessing: 0.54 seconds
Time with multiprocessing. 9.33 seconds