ShapModel.add_sample_mean_dif

static ShapModel.add_sample_mean_dif(X, labels, label_ref=0, *, df_feat, drop=False, samples=None, names=None, group_average=False, df_seq=None, sample_positions=None)[source]

Compute the feature value difference between selected samples and a reference group average.

Three different scenarios for computing the difference with the reference group average (MEAN_REF) are possible:

Single sample: Computes the difference for a selected sample and MEAN_REF.

Multiple samples: Computes differences for multiple selected samples (all by default) individually against MEAN_REF.

Group of samples: Computes the difference between the average of a selected group of samples and MEAN_REF.

The calculated differences are added to df_feat as new columns named mean_dif_'name(s)', corresponding to each sample or group.

Note

This per-sample column (sample minus the label_ref group average) is what a sample-level CPP-SHAP map or ranking should be colored by: pass it as col_val to CPPPlot.feature_map(), CPPPlot.ranking(), or CPPPlot.profile(). It is not the group-level mean_dif from CPP.run() (test group minus reference group), which is identical for every sample. Choose label_ref so the reference group matches the contrast you want each sample explained against (e.g. an others / unlabeled group rather than a curated negative set).

Added in version 0.1.0.

Parameters:

X (array-like, shape (n_samples, n_features)) – Feature matrix. Rows typically correspond to proteins and columns to features.
labels (array-like, shape (n_samples)) – Class labels for samples in X (typically, 1=positive, 0=negative).
label_ref (int, default=0,) – Class label of reference group in labels.
df_feat (pd.DataFrame, shape (n_features, n_feature_info)) – Feature DataFrame with a unique identifier, scale information, statistics, and positions for each feature.
drop (bool, default=False) – If True, allow dropping of already existing sample specific mean difference columns from df_feat before inserting.
samples (int, list of int, str, list of str, or None) – Sample(s) to compute the difference for, given either as row position(s) in X or as entry name(s) (str) from the entry column of df_seq (resolved to the matching row(s)). If None, the difference for each sample will be returned.
names (str or list of str, optional) –
Unique name(s) used for the feature value differences columns. When provided, they should align with samples as follows:
- Single sample: name as string and samples as integer or entry name.
- Multiple samples: name as list of string and samples as corresponding list.
- Group: name as string and samples as list, each indicating a group sample.
If samples is None (all samples are considered), name must be list with names for each sample. When samples is given as entry name(s) and names is None, the entry name(s) are used.
group_average (bool, default=False) – If True, compute the average of samples given by samples.
df_seq (pd.DataFrame, shape (n_samples, n_seq_info), optional) – DataFrame containing an entry column with a unique protein identifier per row, row-aligned to X. Required only when samples is given as entry name(s).
sample_positions (int, list of int, str, list of str, or None) – Deprecated alias for samples (removed in 1.2.0).

Returns:

df_feat – Feature DataFrame including feature value difference. If the feature value difference is computed for multiple samples, n=number of samples; n=1, otherwise.

Return type:

pd.DataFrame, shape (n_features, n_feature_info+n)

Examples

To demonstrate the ShapModel().add_sample_mean_dif() method, we obtain the DOM_GSEC example dataset and its respective feature set (see [Breimann25]):

import aaanalysis as aa
aa.options["verbose"] = False # Disable verbosity

df_seq = aa.load_dataset(name="DOM_GSEC", n=3)
labels = df_seq["label"].to_list()
df_feat = aa.load_features(name="DOM_GSEC").head(5)

# Create feature matrix
sf = aa.SequenceFeature()
df_parts = sf.get_df_parts(df_seq=df_seq)
X = sf.feature_matrix(features=df_feat["feature"], df_parts=df_parts)

aa.display_df(df_seq)

/Users/stephanbreimann/Programming/1Packages/wt-410-dataset-metadata/aaanalysis/feature_engineering/_backend/cpp_run.py:164: UserWarning: CPP is using the Python kernel fallback — the compiled Cython extension is not available in this install. Output is bit-exact with the Cython path but ~2x slower. Reinstall via pip install --force-reinstall aaanalysis to fetch a prebuilt wheel.
  warnings.warn(

	entry	gene	sequence	label	tmd_start	tmd_stop	jmd_n	tmd	jmd_c
1	Q14802	FXYD3	MQKVTLGLLVFLAGF...PGETPPLITPGSAQS	0	37	59	NSPFYYDWHS	LQVGGLICAGVLCAMGIIIVMSA	KCKCKFGQKS
2	Q86UE4	MTDH	MAARSWQDELAQQAE...SPKQIKKKKKARRET	0	50	72	LGLEPKRYPG	WVILVGTGALGLLLLFLLGYGWA	AACAGARKKR
3	Q969W9	PMEPA1	MHRLMGVNSTAAAAA...AIWSKEKDKQKGHPL	0	41	63	FQSMEITELE	FVQIIIIVVVMMVMVVVITCLLS	HYKLSARSFI
4	P05067	APP	MLPGLALLLLAAWTA...GYENPTYKFFEQMQN	1	701	723	FAEDVGSNKG	AIIGLMVGGVVIATVIVITLVML	KKKQYTSIHH
5	P14925	Pam	MAGRARSGLLLLLLG...EEEYSAPLPKPAPSS	1	868	890	KLSTEPGSGV	SVVLITTLLVIPVLVLLAIVMFI	RWKKSRAFGD
6	P70180	Npr3	MRSLLLFTFSACVLL...RELREDSIRSHFSVA	1	477	499	PCKSSGGLEE	SAVTGIVVGALLGAGLLMAFYFF	RKKYRITIER

You need to provide X, labels, and df_feat to the ShapModel().add_samples_mean_dif() method, which will then compute the feature value difference for each sample against the reference group average:

sm = aa.ShapModel()

# Compute difference against average for negative (0) group
df_feat = sm.add_sample_mean_dif(X, labels=labels, df_feat=df_feat)
aa.display_df(df_feat)

	feature	category	subcategory	scale_name	scale_description	abs_auc	abs_mean_dif	mean_dif	std_test	std_ref	p_val_fdr_bh	positions	feat_importance	feat_importance_std	mean_dif_Protein0	mean_dif_Protein1	mean_dif_Protein2	mean_dif_Protein3	mean_dif_Protein4	mean_dif_Protein5
1	TMD_C_JMD_C-Seg...3,4)-KLEP840101	Energy	Charge	Charge	Net charge (Kle...n et al., 1984)	0.244000	0.103666	0.103666	0.106692	0.110506	0.000000	31,32,33,34,35	0.970400	1.438918	0.100000	-0.100000	0.000000	0.200000	0.200000	0.200000
2	TMD_C_JMD_C-Seg...3,4)-FINA910104	Conformation	α-helix (C-cap)	α-helix termination	Helix terminati...n et al., 1991)	0.243000	0.085064	0.085064	0.098774	0.096946	0.000000	31,32,33,34,35	0.000000	0.000000	0.087600	-0.087600	0.000000	0.175200	0.175200	0.175200
3	TMD_C_JMD_C-Seg...6,9)-LEVM760105	Shape	Side chain length	Side chain length	Radius of gyrat... (Levitt, 1976)	0.233000	0.137044	0.137044	0.161683	0.176964	0.000001	32,33	1.554800	2.109848	0.123890	-0.360780	0.236890	0.282890	0.362890	0.338560
4	TMD_C_JMD_C-Seg...3,4)-HUTJ700102	Energy	Entropy	Entropy	Absolute entrop...Hutchens, 1970)	0.229000	0.098224	0.098224	0.106865	0.124608	0.000001	31,32,33,34,35	3.111200	3.109955	0.131267	-0.269733	0.138467	0.231467	0.312467	0.277867
5	TMD_C_JMD_C-Seg...6,9)-RADA880106	ASA/Volume	Volume	Accessible surface area (ASA)	Accessible surf...olfenden, 1988)	0.223000	0.095071	0.095071	0.114758	0.132829	0.000002	32,33	0.000000	0.000000	0.067557	-0.230443	0.162887	0.165557	0.278227	0.208887

To change the reference group, use the label_ref parameter (default=0). Since df_feat already contains mean difference columns, we must set drop=True to remove them:

# Compute difference against average for positive (1) group
df_feat = sm.add_sample_mean_dif(X, labels=labels, label_ref=1, df_feat=df_feat, drop=True)
aa.display_df(df_feat, n_cols=-6)

	mean_dif_Protein0	mean_dif_Protein1	mean_dif_Protein2	mean_dif_Protein3	mean_dif_Protein4	mean_dif_Protein5
1	-0.100000	-0.300000	-0.200000	-0.000000	-0.000000	-0.000000
2	-0.087600	-0.262800	-0.175200	-0.000000	-0.000000	-0.000000
3	-0.204223	-0.688893	-0.091223	-0.045223	0.034777	0.010447
4	-0.142667	-0.543667	-0.135467	-0.042467	0.038533	0.003933
5	-0.150000	-0.448000	-0.054670	-0.052000	0.060670	-0.008670

Select a specific sample based in its position index in label using the samples parameter. You can provide its name by the names parameter:

# Single sample
df_feat = sm.add_sample_mean_dif(X, labels=labels, df_feat=df_feat, drop=True, samples=0, names="Selected_sample")
aa.display_df(df_feat, n_cols=-1)

	mean_dif_Selected_sample
1	0.100000
2	0.087600
3	0.123890
4	0.131267
5	0.067557

Samples can also be selected by accession instead of position. Provide df_seq and pass entry name(s) to samples; they are resolved to the matching row(s), and the difference column is named after the accession unless names is given:

# Select a sample by its entry/accession via df_seq, labelling the mean-difference column by
# the readable gene name from load_dataset's bundled 'gene' column.
row = df_seq.iloc[0]
df_feat = sm.add_sample_mean_dif(X, labels=labels, df_feat=df_feat, drop=True,
                                 df_seq=df_seq, samples=row["entry"], names=row["gene"])
aa.display_df(df_feat, n_cols=-1)

	mean_dif_FXYD3
1	0.100000
2	0.087600
3	0.123890
4	0.131267
5	0.067557

Three different scenarios are possible:

Single sample: Compute the difference for a single sample (above).
Multiple samples: Compute the difference for multiple samples (all by default).
Group of samples: Compute the difference using the average of a group of samples.

To target on specific samples, define their indices in samples. Ensure the names parameter, if used, corresponds in length to samples.

# Multiple samples
df_feat = sm.add_sample_mean_dif(X, labels=labels, df_feat=df_feat, drop=True, samples=[0, 1], names=["Sample 1", "Sample 2"])
aa.display_df(df_feat, n_cols=-2)

	mean_dif_Sample 1	mean_dif_Sample 2
1	0.100000	-0.100000
2	0.087600	-0.087600
3	0.123890	-0.360780
4	0.131267	-0.269733
5	0.067557	-0.230443

To compute the group average, set group_average=True and specify the sample indices in sample_positions.Assign a name to the group using the names parameter; if not provided, ‘Group’ will be used as the default name:

# Group of samples
df_feat = sm.add_sample_mean_dif(X, labels=labels, df_feat=df_feat, drop=True, samples=[0, 1], group_average=True)
aa.display_df(df_feat, n_cols=-1)

	mean_dif_Group
1	0.000000
2	0.000000
3	-0.118445
4	-0.069233
5	-0.081443