SequenceFeature.feature_matrix
- SequenceFeature.feature_matrix(features, df_parts, df_scales=None, accept_gaps=False, n_jobs=1, batch=False)[source]
Create feature matrix for given feature ids and sequence parts.
For each sample (row of
df_parts) and each feature id, looks up the physicochemical scale values at the residue positions defined by the feature’s Part and Split components and averages them into a single feature value. The result is the numerical inputXconsumed byCPP.run()and byNumericalFeature.filter_correlation().Added in version 0.1.0.
Changed in version 1.1.0: Added the
batchparameter for building a list ofdf_partsin a single pass.- Parameters:
features (array-like, shape (n_features,) or pd.DataFrame) – Ids of features (
'PART-SPLIT-SCALE') for which matrix of feature values should be created. Alternatively, adf_featDataFrame, in which case its'feature'column is used.df_parts (pd.DataFrame, shape (n_samples, n_parts)) – DataFrame with sequence parts. If
batch=True, instead a list of such DataFrames (one per batch; all must share the same part columns).df_scales (pd.DataFrame, shape (n_letters, n_scales), optional) – DataFrame of scales with letters typically representing amino acids. Default from
load_scales()unless specified inoptions['df_scales'].accept_gaps (bool, default=False) – Whether to accept missing values by enabling omitting for computations (if
True).n_jobs (int, None, or -1, default=1) – Number of CPU cores (>=1) used for multiprocessing. If
None, the number is optimized automatically. If-1, the number is set to all available cores. Overridden byoptions['n_jobs']when set.batch (bool, default=False) – If
True,df_partsis a list of part DataFrames processed in one amortized call (concatenated → Cython builder runs once → split back), returning one matrix per batch. Use for per-protein sliding scoring where the samefeaturesare applied to many smalldf_partsin a tight loop; the result is byte-identical to calling this per batch.
- Returns:
X – Feature matrix containing feature values for samples. If
batch=True, a list of such matrices aligned to the input list ofdf_parts.- Return type:
array-like, shape (n_samples, n_features)
Notes
Use parallel processing only for high number of features (>~1000 features per core)
batch=Trueamortizes the per-call scale-lookup build and kernel warm-up that dominate when this method is called thousands of times on smalldf_parts.
Examples
To demonstrate the
SequenceFeature().feature_matrix()method, we load theDOM_GSECexample dataset including its respective features (see [Breimann25]):import aaanalysis as aa aa.options["verbose"] = False df_seq = aa.load_dataset(name="DOM_GSEC") labels = df_seq["label"].to_list() df_feat = aa.load_features(name="DOM_GSEC") features = df_feat["feature"].to_list() sf = aa.SequenceFeature() df_parts = sf.get_df_parts(df_seq=df_seq)
featuresanddf_partsmust be provided to retrieve the feature matrix:X = sf.feature_matrix(features=features, df_parts=df_parts) print(f"n samples: {len(df_parts)}") print(f"n features: {len(features)}") # X has a shape of n_samples, n_features print(f"Shape of X: {X.shape}")
n samples: 126 n features: 150 Shape of X: (126, 150)
Instead of a list of feature ids, the
df_featDataFrame can be passed directly asfeatures— its'feature'column is used automatically, sofeatures=df_featis equivalent tofeatures=list(df_feat["feature"]):X = sf.feature_matrix(features=df_feat, df_parts=df_parts) print(f"Shape of X: {X.shape}")
Shape of X: (126, 150)
If sequences in
df_parts, you can enableaccept_gapsso that the feature values are computed as the average of the part-split combination ignoring gaps.X = sf.feature_matrix(features=features, df_parts=df_parts, accept_gaps=True)
Multiprocessing can be used by using the
n_jobsparameter, which is set to the maximum ifn_jobs=None. However, this is only recommend for more than ~1000 features per core due to potential process management overhead.import time # Run without multiprocessing time_start = time.time() X = sf.feature_matrix(features=features, df_parts=df_parts) time_no_mp = round(time.time() - time_start, 2) print(f"Time without multiprocessing: {time_no_mp} seconds") # Run with multiprocessing time_start = time.time() X = sf.feature_matrix(features=features, df_parts=df_parts, n_jobs=None) time_mp = round(time.time() - time_start, 2) print(f"Time with multiprocessing. {time_mp} seconds")
Time without multiprocessing: 0.02 seconds Time with multiprocessing. 0.02 seconds
For per-protein sliding scoring the same features are applied to many small df_parts in a tight loop. Passing batch=True with a list of part DataFrames concatenates them, runs the (Cython) builder once, and splits the result back — one matrix per batch, byte-identical to calling the method per batch but much faster:
# split df_parts into per-protein batches (here: one row each) list_df_parts = [df_parts.iloc[i:i+1] for i in range(len(df_parts))] list_X = sf.feature_matrix(features=features, df_parts=list_df_parts, batch=True) print(f"n batches: {len(list_X)}") print(f"shape of first matrix: {list_X[0].shape}")
n batches: 126 shape of first matrix: (1, 150)
Further parameters.
SequenceFeature.feature_matrixalso accepts:df_scales— DataFrame of scales with letters typically representing amino acids.