NumericalFeature.extend_alphabet

static NumericalFeature.extend_alphabet(df_scales, new_letter, value_type='mean')[source]

Extend amino acid alphabet of df_scales by new letter.

This function adds a new row to the DataFrame, representing the new amino acid letter. For each scale (column), it computes a specific statistic (min, mean, median, max) based on the values of existing amino acids (rows) and assigns this computed value to the new amino acid.

Added in version 0.1.0.

Parameters:

df_scales (pd.DataFrame, shape (n_letters, n_scales)) – DataFrame of scales with letters typically representing amino acids.
new_letter (str) – The new letter to be added to the alphabet.
value_type ({'min', 'mean', 'median', 'max'}, default='mean') – The type of statistic to compute for the new letter.

Returns:

df_scales – DataFrame with the extended alphabet including the new amino acid letter.

Return type:

pd.DataFrame, shape (n_letters + 1, n_scales)

Notes

If new_letter is already present in the index of df_scales, a ValueError is raised. Use this method only to add letters that do not yet appear in the alphabet.

Examples

To address gaps in protein sequences caused by amino acids not encoded in your scales, the NumericalFeature().extend_alphabet() method effectively expands the existing alphabet. It adds new letters with values based on key statistics like minimum or average, calculated from the current amino acids. This enhancement helps prevent missing values and improves the reliability of feature engineering. To demonstrate this, we load our default scale DataFrame using load_scales:

import aaanalysis as aa
df_scales = aa.load_scales()
aa.display_df(df_scales, n_cols=3, show_shape=True)

DataFrame shape: (20, 586)

	ANDN920101	ARGP820101	ARGP820102
AA
A	0.494000	0.230000	0.355000
C	0.864000	0.404000	0.579000
D	1.000000	0.174000	0.000000
E	0.420000	0.177000	0.019000
F	0.877000	0.762000	0.601000
G	0.025000	0.026000	0.138000
H	0.840000	0.230000	0.082000
I	0.000000	0.838000	0.440000
K	0.506000	0.434000	0.003000
L	0.272000	0.577000	1.000000
M	0.704000	0.445000	0.824000
N	0.988000	0.023000	0.057000
P	0.605000	0.736000	0.223000
Q	0.519000	0.000000	0.211000
R	0.531000	0.226000	0.047000
S	0.679000	0.019000	0.289000
T	0.494000	0.019000	0.248000
V	0.000000	0.498000	0.324000
W	0.926000	1.000000	0.226000
Y	0.802000	0.709000	0.107000

Using the utility NumericalFeature class, you can add a new letter (new_letter) to the df_seq DataFrame and select a value_type (default=‘mean’).

nf = aa.NumericalFeature()
# Add new letter in last row of DataFrame
df_scales_x_mean = nf.extend_alphabet(df_scales=df_scales, new_letter="X")
aa.display_df(df_scales_x_mean, n_cols=3, show_shape=True, row_to_show="X")

DataFrame shape: (21, 586)

	ANDN920101	ARGP820101	ARGP820102
AA
X	0.577300	0.376350	0.288650

# This should set each value of X to 0 since scales are min-max normalized
df_scales_x_min = nf.extend_alphabet(df_scales=df_scales, new_letter="X", value_type="min")
aa.display_df(df_scales_x_min, n_cols=3, row_to_show="X")

	ANDN920101	ARGP820101	ARGP820102
AA
X	0.000000	0.000000	0.000000

This modified df_scales DataFrame can now set as global default using options:

aa.options["df_scales"] = df_scales_x_mean
# This will set internal default df_scales (but not affect load_scales)
cpp_plot = aa.CPPPlot()
df_scales_default = cpp_plot._df_scales
aa.display_df(df_scales_default, n_cols=3, row_to_show="X")

	ANDN920101	ARGP820101	ARGP820102
AA
X	0.577300	0.376350	0.288650