ShapModel.add_sample_mean_dif

static ShapModel.add_sample_mean_dif(X, labels, label_ref=0, *, df_feat, drop=False, samples=None, names=None, group_average=False, df_seq=None, sample_positions=None)[source]

Compute the feature value difference between selected samples and a reference group average.

Three different scenarios for computing the difference with the reference group average (MEAN_REF) are possible:

  1. Single sample: Computes the difference for a selected sample and MEAN_REF.

  2. Multiple samples: Computes differences for multiple selected samples (all by default) individually against MEAN_REF.

  3. Group of samples: Computes the difference between the average of a selected group of samples and MEAN_REF.

The calculated differences are added to df_feat as new columns named mean_dif_'name(s)', corresponding to each sample or group.

Added in version 0.1.0.

Parameters:
  • X (array-like, shape (n_samples, n_features)) – Feature matrix. Rows typically correspond to proteins and columns to features.

  • labels (array-like, shape (n_samples)) – Class labels for samples in X (typically, 1=positive, 0=negative).

  • label_ref (int, default=0,) – Class label of reference group in labels.

  • df_feat (pd.DataFrame, shape (n_features, n_feature_info)) – Feature DataFrame with a unique identifier, scale information, statistics, and positions for each feature.

  • drop (bool, default=False) – If True, allow dropping of already existing sample specific mean difference columns from df_feat before inserting.

  • samples (int, list of int, str, list of str, or None) – Sample(s) to compute the difference for, given either as row position(s) in X or as entry name(s) (str) from the entry column of df_seq (resolved to the matching row(s)). If None, the difference for each sample will be returned.

  • names (str or list of str, optional) –

    Unique name(s) used for the feature value differences columns. When provided, they should align with samples as follows:

    • Single sample: name as string and samples as integer or entry name.

    • Multiple samples: name as list of string and samples as corresponding list.

    • Group: name as string and samples as list, each indicating a group sample.

    If samples is None (all samples are considered), name must be list with names for each sample. When samples is given as entry name(s) and names is None, the entry name(s) are used.

  • group_average (bool, default=False) – If True, compute the average of samples given by samples.

  • df_seq (pd.DataFrame, shape (n_samples, n_seq_info), optional) – DataFrame containing an entry column with a unique protein identifier per row, row-aligned to X. Required only when samples is given as entry name(s).

  • sample_positions (int, list of int, str, list of str, or None) – Deprecated alias for samples (removed in 1.2.0).

Returns:

df_feat – Feature DataFrame including feature value difference. If the feature value difference is computed for multiple samples, n=number of samples; n=1, otherwise.

Return type:

pd.DataFrame, shape (n_features, n_feature_info+n)

Examples

To demonstrate the ShapModel().add_sample_mean_dif() method, we obtain the DOM_GSEC example dataset and its respective feature set (see [Breimann25]):

import aaanalysis as aa
aa.options["verbose"] = False # Disable verbosity

df_seq = aa.load_dataset(name="DOM_GSEC", n=3)
labels = df_seq["label"].to_list()
df_feat = aa.load_features(name="DOM_GSEC").head(5)

# Create feature matrix
sf = aa.SequenceFeature()
df_parts = sf.get_df_parts(df_seq=df_seq)
X = sf.feature_matrix(features=df_feat["feature"], df_parts=df_parts)

aa.display_df(df_seq)
/Users/stephanbreimann/Programming/1Packages/aaanalysis-shap-acc/aaanalysis/feature_engineering/_backend/cpp_run.py:143: UserWarning: CPP is using the Python kernel fallback — the compiled Cython extension is not available in this install. Output is bit-exact with the Cython path but ~2x slower. Reinstall via pip install --force-reinstall aaanalysis to fetch a prebuilt wheel.
  warnings.warn(
  entry sequence label tmd_start tmd_stop jmd_n tmd jmd_c
1 Q14802 MQKVTLGLLVFLAGF...PGETPPLITPGSAQS 0 37 59 NSPFYYDWHS LQVGGLICAGVLCAMGIIIVMSA KCKCKFGQKS
2 Q86UE4 MAARSWQDELAQQAE...SPKQIKKKKKARRET 0 50 72 LGLEPKRYPG WVILVGTGALGLLLLFLLGYGWA AACAGARKKR
3 Q969W9 MHRLMGVNSTAAAAA...AIWSKEKDKQKGHPL 0 41 63 FQSMEITELE FVQIIIIVVVMMVMVVVITCLLS HYKLSARSFI
4 P05067 MLPGLALLLLAAWTA...GYENPTYKFFEQMQN 1 701 723 FAEDVGSNKG AIIGLMVGGVVIATVIVITLVML KKKQYTSIHH
5 P14925 MAGRARSGLLLLLLG...EEEYSAPLPKPAPSS 1 868 890 KLSTEPGSGV SVVLITTLLVIPVLVLLAIVMFI RWKKSRAFGD
6 P70180 MRSLLLFTFSACVLL...RELREDSIRSHFSVA 1 477 499 PCKSSGGLEE SAVTGIVVGALLGAGLLMAFYFF RKKYRITIER

You need to provide X, labels, and df_feat to the ShapModel().add_samples_mean_dif() method, which will then compute the feature value difference for each sample against the reference group average:

sm = aa.ShapModel()

# Compute difference against average for negative (0) group
df_feat = sm.add_sample_mean_dif(X, labels=labels, df_feat=df_feat)
aa.display_df(df_feat)
  feature category subcategory scale_name scale_description abs_auc abs_mean_dif mean_dif std_test std_ref p_val_mann_whitney p_val_fdr_bh positions feat_importance feat_importance_std mean_dif_Protein0 mean_dif_Protein1 mean_dif_Protein2 mean_dif_Protein3 mean_dif_Protein4 mean_dif_Protein5
1 TMD_C_JMD_C-Seg...3,4)-KLEP840101 Energy Charge Charge Net charge (Kle...n et al., 1984) 0.244000 0.103666 0.103666 0.106692 0.110506 0.000000 0.000000 31,32,33,34,35 0.970400 1.438918 0.100000 -0.100000 0.000000 0.200000 0.200000 0.200000
2 TMD_C_JMD_C-Seg...3,4)-FINA910104 Conformation α-helix (C-cap) α-helix termination Helix terminati...n et al., 1991) 0.243000 0.085064 0.085064 0.098774 0.096946 0.000000 0.000000 31,32,33,34,35 0.000000 0.000000 0.087600 -0.087600 0.000000 0.175200 0.175200 0.175200
3 TMD_C_JMD_C-Seg...6,9)-LEVM760105 Shape Side chain length Side chain length Radius of gyrat... (Levitt, 1976) 0.233000 0.137044 0.137044 0.161683 0.176964 0.000000 0.000001 32,33 1.554800 2.109848 0.123890 -0.360780 0.236890 0.282890 0.362890 0.338560
4 TMD_C_JMD_C-Seg...3,4)-HUTJ700102 Energy Entropy Entropy Absolute entrop...Hutchens, 1970) 0.229000 0.098224 0.098224 0.106865 0.124608 0.000000 0.000001 31,32,33,34,35 3.111200 3.109955 0.131267 -0.269733 0.138467 0.231467 0.312467 0.277867
5 TMD_C_JMD_C-Seg...6,9)-RADA880106 ASA/Volume Volume Accessible surface area (ASA) Accessible surf...olfenden, 1988) 0.223000 0.095071 0.095071 0.114758 0.132829 0.000000 0.000002 32,33 0.000000 0.000000 0.067557 -0.230443 0.162887 0.165557 0.278227 0.208887

To change the reference group, use the label_ref parameter (default=0). Since df_feat already contains mean difference columns, we must set drop=True to remove them:

# Compute difference against average for positive (1) group
df_feat = sm.add_sample_mean_dif(X, labels=labels, label_ref=1, df_feat=df_feat, drop=True)
aa.display_df(df_feat, n_cols=-6)
  mean_dif_Protein0 mean_dif_Protein1 mean_dif_Protein2 mean_dif_Protein3 mean_dif_Protein4 mean_dif_Protein5
1 -0.100000 -0.300000 -0.200000 -0.000000 -0.000000 -0.000000
2 -0.087600 -0.262800 -0.175200 -0.000000 -0.000000 -0.000000
3 -0.204223 -0.688893 -0.091223 -0.045223 0.034777 0.010447
4 -0.142667 -0.543667 -0.135467 -0.042467 0.038533 0.003933
5 -0.150000 -0.448000 -0.054670 -0.052000 0.060670 -0.008670

Select a specific sample based in its position index in label using the samples parameter. You can provide its name by the names parameter:

# Single sample
df_feat = sm.add_sample_mean_dif(X, labels=labels, df_feat=df_feat, drop=True, samples=0, names="Selected_sample")
aa.display_df(df_feat, n_cols=-1)
  mean_dif_Selected_sample
1 0.100000
2 0.087600
3 0.123890
4 0.131267
5 0.067557

Samples can also be selected by accession instead of position. Provide df_seq and pass entry name(s) to samples; they are resolved to the matching row(s), and the difference column is named after the accession unless names is given:

# Select a sample by its entry/accession via df_seq
entry = df_seq["entry"].iloc[0]
df_feat = sm.add_sample_mean_dif(X, labels=labels, df_feat=df_feat, drop=True,
                                 df_seq=df_seq, samples=entry)
aa.display_df(df_feat, n_cols=-1)
  mean_dif_Q14802
1 0.100000
2 0.087600
3 0.123890
4 0.131267
5 0.067557

Three different scenarios are possible:

  1. Single sample: Compute the difference for a single sample (above).

  2. Multiple samples: Compute the difference for multiple samples (all by default).

  3. Group of samples: Compute the difference using the average of a group of samples.

To target on specific samples, define their indices in samples. Ensure the names parameter, if used, corresponds in length to samples.

# Multiple samples
df_feat = sm.add_sample_mean_dif(X, labels=labels, df_feat=df_feat, drop=True, samples=[0, 1], names=["Sample 1", "Sample 2"])
aa.display_df(df_feat, n_cols=-2)
  mean_dif_Sample 1 mean_dif_Sample 2
1 0.100000 -0.100000
2 0.087600 -0.087600
3 0.123890 -0.360780
4 0.131267 -0.269733
5 0.067557 -0.230443

To compute the group average, set group_average=True and specify the sample indices in sample_positions.Assign a name to the group using the names parameter; if not provided, ‘Group’ will be used as the default name:

# Group of samples
df_feat = sm.add_sample_mean_dif(X, labels=labels, df_feat=df_feat, drop=True, samples=[0, 1], group_average=True)
aa.display_df(df_feat, n_cols=-1)
  mean_dif_Group
1 0.000000
2 0.000000
3 -0.118445
4 -0.069233
5 -0.081443