aaanalysis.NumericalFeature.extend_alphabet
- static NumericalFeature.extend_alphabet(df_scales=None, new_letter=None, value_type='mean')[source]
Extend amino acid alphabet of
df_scalesby new letter.This function adds a new row to the DataFrame, representing the new amino acid letter. For each scale (column), it computes a specific statistic (min, mean, median, max) based on the values of existing amino acids (rows) and assigns this computed value to the new amino acid.
- Parameters:
df_scales (pd.DataFrame, shape (n_letters, n_scales)) – DataFrame of scales with letters typically representing amino acids.
new_letter (str) – The new letter to be added to the alphabet.
value_type ({'min', 'mean', 'median', 'max'}, default='mean') – The type of statistic to compute for the new letter.
- Returns:
df_scales – DataFrame with the extended alphabet including the new amino acid letter.
- Return type:
pd.DataFrame, shape (n_letters + 1, n_scales)
Examples
To address gaps in protein sequences caused by amino acids not encoded in your scales, the
NumericalFeature().extend_alphabet()method effectively expands the existing alphabet. It adds new letters with values based on key statistics like minimum or average, calculated from the current amino acids. This enhancement helps prevent missing values and improves the reliability of feature engineering. To demonstrate this, we load our default scale DataFrame usingload_scales:import aaanalysis as aa df_scales = aa.load_scales() aa.display_df(df_scales, n_cols=3, show_shape=True)
DataFrame shape: (20, 586)
ANDN920101 ARGP820101 ARGP820102 AA A 0.494000 0.230000 0.355000 C 0.864000 0.404000 0.579000 D 1.000000 0.174000 0.000000 E 0.420000 0.177000 0.019000 F 0.877000 0.762000 0.601000 G 0.025000 0.026000 0.138000 H 0.840000 0.230000 0.082000 I 0.000000 0.838000 0.440000 K 0.506000 0.434000 0.003000 L 0.272000 0.577000 1.000000 M 0.704000 0.445000 0.824000 N 0.988000 0.023000 0.057000 P 0.605000 0.736000 0.223000 Q 0.519000 0.000000 0.211000 R 0.531000 0.226000 0.047000 S 0.679000 0.019000 0.289000 T 0.494000 0.019000 0.248000 V 0.000000 0.498000 0.324000 W 0.926000 1.000000 0.226000 Y 0.802000 0.709000 0.107000 Using the utility
NumericalFeatureclass, you can add a new letter (new_letter) to thedf_seqDataFrame and select avalue_type(default=‘mean’).nf = aa.NumericalFeature() # Add new letter in last row of DataFrame df_scales_x_mean = nf.extend_alphabet(df_scales=df_scales, new_letter="X") aa.display_df(df_scales_x_mean, n_cols=3, show_shape=True, row_to_show="X")
DataFrame shape: (21, 586)
ANDN920101 ARGP820101 ARGP820102 AA X 0.577300 0.376350 0.288650 # This should set each value of X to 0 since scales are min-max normalized df_scales_x_min = nf.extend_alphabet(df_scales=df_scales, new_letter="X", value_type="min") aa.display_df(df_scales_x_min, n_cols=3, row_to_show="X")
ANDN920101 ARGP820101 ARGP820102 AA X 0.000000 0.000000 0.000000 This modified
df_scalesDataFrame can now set as global default usingoptions:aa.options["df_scales"] = df_scales_x_mean # This will set internal default df_scales (but not affect load_scales) cpp_plot = aa.CPPPlot() df_scales_default = cpp_plot._df_scales aa.display_df(df_scales_default, n_cols=3, row_to_show="X")
ANDN920101 ARGP820101 ARGP820102 AA X 0.577300 0.376350 0.288650