SequenceFeature.feature_matrix

SequenceFeature.feature_matrix(features, df_parts=None, df_scales=None, accept_gaps=False, n_jobs=1, batch=False, df_seq=None, df_parts_kws=None)[source]

Create feature matrix for given feature ids and sequence parts.

For each sample (row of df_parts) and each feature id, looks up the physicochemical scale values at the residue positions defined by the feature’s Part and Split components and averages them into a single feature value. The result is the numerical input X consumed by CPP.run() and by NumericalFeature.filter_correlation().

Added in version 0.1.0.

Changed in version 1.1.0: Added the batch parameter for building a list of df_parts in a single pass.

Changed in version 1.1.0: Added the df_seq and df_parts_kws parameters to build df_parts internally, so the sequence-to-matrix step no longer requires a separate get_df_parts() call.

Parameters:

features (array-like, shape (n_features,) or pd.DataFrame) – Ids of features ('PART-SPLIT-SCALE') for which matrix of feature values should be created. Alternatively, a df_feat DataFrame, in which case its 'feature' column is used.
df_parts (pd.DataFrame, shape (n_samples, n_parts), optional) – DataFrame with sequence parts. If batch=True, instead a list of such DataFrames (one per batch; all must share the same part columns). Provide exactly one of df_parts or df_seq.
df_scales (pd.DataFrame, shape (n_letters, n_scales), optional) – DataFrame of scales with letters typically representing amino acids. Default from load_scales() unless specified in options['df_scales'].
accept_gaps (bool, default=False) – Whether to accept missing values by enabling omitting for computations (if True).
n_jobs (int, None, or -1, default=1) – Number of CPU cores (>=1) used for multiprocessing. If None, the number is optimized automatically. If -1, the number is set to all available cores. Overridden by options['n_jobs'] when set.
batch (bool, default=False) – If True, df_parts is a list of part DataFrames processed in one amortized call (concatenated → Cython builder runs once → split back), returning one matrix per batch. Use for per-protein sliding scoring where the same features are applied to many small df_parts in a tight loop; the result is byte-identical to calling this per batch. Not supported together with df_seq.
df_seq (pd.DataFrame, shape (n_samples, n_seq_info), optional) – DataFrame containing an entry column with unique protein identifiers and sequence information in a distinct format: Position-based, Part-based, Sequence-based, or Sequence-TMD-based. If given, df_parts is built internally via get_df_parts(), as an alternative to passing df_parts directly. Provide exactly one of df_parts or df_seq.
df_parts_kws (dict, optional) – Keyword arguments forwarded to get_df_parts() when building df_parts from df_seq (e.g. {"list_parts": ["tmd"], "jmd_n_len": 10, "jmd_c_len": 10}). Keys must be get_df_parts() parameter names (df_seq excluded); unset options use their defaults. The JMD flank lengths jmd_n_len / jmd_c_len default to 10, while tmd_len defaults to None (the TMD length is variable, read from each sequence, except in the Position-based input mode where it is fixed). Only valid together with df_seq.

Returns:

X – Feature matrix containing feature values for samples. If batch=True, a list of such matrices aligned to the input list of df_parts.

Return type:

array-like, shape (n_samples, n_features)

Notes

Use parallel processing only for high number of features (>~1000 features per core)
batch=True amortizes the per-call scale-lookup build and kernel warm-up that dominate when this method is called thousands of times on small df_parts.

Examples

To demonstrate the SequenceFeature().feature_matrix() method, we load the DOM_GSEC example dataset including its respective features (see [Breimann25]):

import aaanalysis as aa
aa.options["verbose"] = False
df_seq = aa.load_dataset(name="DOM_GSEC")
labels = df_seq["label"].to_list()
df_feat = aa.load_features(name="DOM_GSEC")
features = df_feat["feature"].to_list()
sf = aa.SequenceFeature()
df_parts = sf.get_df_parts(df_seq=df_seq)

features and df_parts must be provided to retrieve the feature matrix:

X = sf.feature_matrix(features=features, df_parts=df_parts)
print(f"n samples: {len(df_parts)}")
print(f"n features: {len(features)}")
# X has a shape of n_samples, n_features
print(f"Shape of X: {X.shape}")

n samples: 126
n features: 150
Shape of X: (126, 150)

Instead of a list of feature ids, the df_feat DataFrame can be passed directly as features — its 'feature' column is used automatically, so features=df_feat is equivalent to features=list(df_feat["feature"]):

X = sf.feature_matrix(features=df_feat, df_parts=df_parts)
print(f"Shape of X: {X.shape}")

Shape of X: (126, 150)

Instead of building df_parts first, you can pass the sequence DataFrame df_seq (optionally with df_parts_kws to forward :meth:get_df_parts options such as list_parts or the JMD lengths) directly. feature_matrix then builds df_parts internally via :meth:SequenceFeature.get_df_parts, collapsing the two-step pattern into one call. The result is identical to the explicit df_parts= form; provide exactly one of df_parts or df_seq:

# One-step: build df_parts internally from df_seq
X_one_step = sf.feature_matrix(features=features, df_seq=df_seq)
# Equivalent two-step form
X_two_step = sf.feature_matrix(features=features, df_parts=sf.get_df_parts(df_seq=df_seq))
print(f"Shape of X: {X_one_step.shape}")
print(f"Identical to two-step result: {(X_one_step == X_two_step).all()}")
# Forward get_df_parts options via df_parts_kws (e.g. custom JMD lengths)
X_kws = sf.feature_matrix(features=features, df_seq=df_seq,
                          df_parts_kws={"jmd_n_len": 10, "jmd_c_len": 10})
print(f"df_parts_kws result shape: {X_kws.shape}")

Shape of X: (126, 150)
Identical to two-step result: True
df_parts_kws result shape: (126, 150)

If sequences in df_parts, you can enable accept_gaps so that the feature values are computed as the average of the part-split combination ignoring gaps.

X = sf.feature_matrix(features=features, df_parts=df_parts, accept_gaps=True)

Multiprocessing can be used by using the n_jobs parameter, which is set to the maximum if n_jobs=None. However, this is only recommend for more than ~1000 features per core due to potential process management overhead.

import time

# Run without multiprocessing
time_start = time.time()
X = sf.feature_matrix(features=features, df_parts=df_parts)
time_no_mp = round(time.time() - time_start, 2)
print(f"Time without multiprocessing: {time_no_mp} seconds")

# Run with multiprocessing
time_start = time.time()
X = sf.feature_matrix(features=features, df_parts=df_parts, n_jobs=None)
time_mp = round(time.time() - time_start, 2)
print(f"Time with multiprocessing. {time_mp} seconds")

Time without multiprocessing: 0.03 seconds
Time with multiprocessing. 0.03 seconds

For per-protein sliding scoring the same features are applied to many small df_parts in a tight loop. Passing batch=True with a list of part DataFrames concatenates them, runs the (Cython) builder once, and splits the result back — one matrix per batch, byte-identical to calling the method per batch but much faster:

# split df_parts into per-protein batches (here: one row each)
list_df_parts = [df_parts.iloc[i:i+1] for i in range(len(df_parts))]
list_X = sf.feature_matrix(features=features, df_parts=list_df_parts, batch=True)
print(f"n batches: {len(list_X)}")
print(f"shape of first matrix: {list_X[0].shape}")

n batches: 126
shape of first matrix: (1, 150)

Further parameters. SequenceFeature.feature_matrix also accepts: df_scales — DataFrame of scales with letters typically representing amino acids.

# Further parameters: pass the scales DataFrame explicitly (defaults to load_scales)
df_scales = aa.load_scales()
X = sf.feature_matrix(features=features, df_parts=df_parts, df_scales=df_scales)
print(f"Shape of X: {X.shape}")

Shape of X: (126, 150)