aaanalysis.ShapModel.add_sample_mean_dif

static ShapModel.add_sample_mean_dif(X, labels=None, label_ref=0, df_feat=None, drop=False, sample_positions=None, names=None, group_average=False)[source]

Compute the feature value difference between selected samples and a reference group average.

Three different scenarios for computing the difference with the reference group average (MEAN_REF) are possible:

  1. Single sample: Computes the difference for a selected sample and MEAN_REF.

  2. Multiple samples: Computes differences for multiple selected samples (all by default) individually against MEAN_REF.

  3. Group of samples: Computes the difference between the average of a selected group of samples and MEAN_REF.

The calculated differences are added to df_feat as new columns named mean_dif_'name(s)', corresponding to each sample or group.

Parameters:
  • X (array-like, shape (n_samples, n_features)) – Feature matrix. Rows typically correspond to proteins and columns to features.

  • labels (array-like, shape (n_samples)) – Class labels for samples in X (typically, 1=positive, 0=negative).

  • label_ref (int, default=0,) – Class label of reference group in labels.

  • df_feat (pd.DataFrame, shape (n_features, n_feature_info)) – Feature DataFrame with a unique identifier, scale information, statistics, and positions for each feature.

  • drop (bool, default=False) – If True, allow dropping of already existing sample specific mean difference columns from df_feat before inserting.

  • sample_positions (int, list of int, or None) – Position index/indices for the sample(s) in shap_values. If None, the impact for each sample will be returned.

  • names (str or list of str, optional) –

    Unique name(s) used for the feature value differences columns. When provided, they should align with sample_positions as follows:

    • Single sample: name as string and sample_positions as integer.

    • Multiple samples: name as list of string and sample_positions as corresponding list of integers.

    • Group: name as string and sample_positions as list of integers, each indicating a group sample.

    If sample_positions is None (all samples are considered), name must be list with names for each sample.

  • group_average (bool, default=False) – If True, compute the average of samples given by sample_positions.

Returns:

df_feat – Feature DataFrame including feature value difference. If the feature value difference is computed for multiple samples, n=number of samples; n=1, otherwise.

Return type:

pd.DataFrame, shape (n_features, n_feature_info+n)

Examples

To demonstrate the ShapModel().add_sample_mean_dif() method, we obtain the DOM_GSEC example dataset and its respective feature set (see [Breimann25a]):

import aaanalysis as aa
aa.options["verbose"] = False # Disable verbosity

df_seq = aa.load_dataset(name="DOM_GSEC", n=3)
labels = df_seq["label"].to_list()
df_feat = aa.load_features(name="DOM_GSEC").head(5)

# Create feature matrix
sf = aa.SequenceFeature()
df_parts = sf.get_df_parts(df_seq=df_seq)
X = sf.feature_matrix(features=df_feat["feature"], df_parts=df_parts)

aa.display_df(df_seq)
  entry sequence label tmd_start tmd_stop jmd_n tmd jmd_c
1 Q14802 MQKVTLGLLVFLAGF...PGETPPLITPGSAQS 0 37 59 NSPFYYDWHS LQVGGLICAGVLCAMGIIIVMSA KCKCKFGQKS
2 Q86UE4 MAARSWQDELAQQAE...SPKQIKKKKKARRET 0 50 72 LGLEPKRYPG WVILVGTGALGLLLLFLLGYGWA AACAGARKKR
3 Q969W9 MHRLMGVNSTAAAAA...AIWSKEKDKQKGHPL 0 41 63 FQSMEITELE FVQIIIIVVVMMVMVVVITCLLS HYKLSARSFI
4 P05067 MLPGLALLLLAAWTA...GYENPTYKFFEQMQN 1 701 723 FAEDVGSNKG AIIGLMVGGVVIATVIVITLVML KKKQYTSIHH
5 P14925 MAGRARSGLLLLLLG...EEEYSAPLPKPAPSS 1 868 890 KLSTEPGSGV SVVLITTLLVIPVLVLLAIVMFI RWKKSRAFGD
6 P70180 MRSLLLFTFSACVLL...RELREDSIRSHFSVA 1 477 499 PCKSSGGLEE SAVTGIVVGALLGAGLLMAFYFF RKKYRITIER

You need to provide X, labels, and df_feat to the ShapModel().add_samples_mean_dif() method, which will then compute the feature value difference for each sample against the reference group average:

sm = aa.ShapModel()

# Compute difference against average for negative (0) group
df_feat = sm.add_sample_mean_dif(X, labels=labels, df_feat=df_feat)
aa.display_df(df_feat)
  feature category subcategory scale_name scale_description abs_auc abs_mean_dif mean_dif std_test std_ref p_val_mann_whitney p_val_fdr_bh positions feat_importance feat_importance_std mean_dif_Protein0 mean_dif_Protein1 mean_dif_Protein2 mean_dif_Protein3 mean_dif_Protein4 mean_dif_Protein5
1 TMD_C_JMD_C-Seg...3,4)-KLEP840101 Energy Charge Charge Net charge (Kle...n et al., 1984) 0.244000 0.103666 0.103666 0.106692 0.110506 0.000000 0.000000 31,32,33,34,35 0.970400 1.438918 0.100000 -0.100000 0.000000 0.200000 0.200000 0.200000
2 TMD_C_JMD_C-Seg...3,4)-FINA910104 Conformation α-helix (C-cap) α-helix termination Helix terminati...n et al., 1991) 0.243000 0.085064 0.085064 0.098774 0.096946 0.000000 0.000000 31,32,33,34,35 0.000000 0.000000 0.087600 -0.087600 0.000000 0.175200 0.175200 0.175200
3 TMD_C_JMD_C-Seg...6,9)-LEVM760105 Shape Side chain length Side chain length Radius of gyrat... (Levitt, 1976) 0.233000 0.137044 0.137044 0.161683 0.176964 0.000000 0.000001 32,33 1.554800 2.109848 0.123890 -0.360780 0.236890 0.282890 0.362890 0.338560
4 TMD_C_JMD_C-Seg...3,4)-HUTJ700102 Energy Entropy Entropy Absolute entrop...Hutchens, 1970) 0.229000 0.098224 0.098224 0.106865 0.124608 0.000000 0.000001 31,32,33,34,35 3.111200 3.109955 0.131267 -0.269733 0.138467 0.231467 0.312467 0.277867
5 TMD_C_JMD_C-Seg...6,9)-RADA880106 ASA/Volume Volume Accessible surface area (ASA) Accessible surf...olfenden, 1988) 0.223000 0.095071 0.095071 0.114758 0.132829 0.000000 0.000002 32,33 0.000000 0.000000 0.067557 -0.230443 0.162887 0.165557 0.278227 0.208887

To change the reference group, use the label_ref parameter (default=0). Since df_feat already contains mean difference columns, we must set drop=True to remove them:

# Compute difference against average for positive (1) group
df_feat = sm.add_sample_mean_dif(X, labels=labels, label_ref=1, df_feat=df_feat, drop=True)
aa.display_df(df_feat, n_cols=-6)
  mean_dif_Protein0 mean_dif_Protein1 mean_dif_Protein2 mean_dif_Protein3 mean_dif_Protein4 mean_dif_Protein5
1 -0.100000 -0.300000 -0.200000 -0.000000 -0.000000 -0.000000
2 -0.087600 -0.262800 -0.175200 -0.000000 -0.000000 -0.000000
3 -0.204223 -0.688893 -0.091223 -0.045223 0.034777 0.010447
4 -0.142667 -0.543667 -0.135467 -0.042467 0.038533 0.003933
5 -0.150000 -0.448000 -0.054670 -0.052000 0.060670 -0.008670

Select a specific sample based in its position index in label using the sample_positions parameter. You can provide its name by the names parameter:

# Single sample
df_feat = sm.add_sample_mean_dif(X, labels=labels, df_feat=df_feat, drop=True, sample_positions=0, names="Selected_sample")
aa.display_df(df_feat, n_cols=-1)
  mean_dif_Selected_sample
1 0.100000
2 0.087600
3 0.123890
4 0.131267
5 0.067557

Three different scenarios are possible:

  1. Single sample: Compute the difference for a single sample (above).

  2. Multiple samples: Compute the difference for multiple samples (all by default).

  3. Group of samples: Compute the difference using the average of a group of samples.

To target on specific samples, define their indices in sample_positions. Ensure the names parameter, if used, corresponds in length to sample_positions.

# Multiple samples
df_feat = sm.add_sample_mean_dif(X, labels=labels, df_feat=df_feat, drop=True, sample_positions=[0, 1], names=["Sample 1", "Sample 2"])
aa.display_df(df_feat, n_cols=-2)
  mean_dif_Sample 1 mean_dif_Sample 2
1 0.100000 -0.100000
2 0.087600 -0.087600
3 0.123890 -0.360780
4 0.131267 -0.269733
5 0.067557 -0.230443

To compute the group average, set group_average=True and specify the sample indices in sample_positions.Assign a name to the group using the names parameter; if not provided, ‘Group’ will be used as the default name:

# Group of samples
df_feat = sm.add_sample_mean_dif(X, labels=labels, df_feat=df_feat, drop=True, sample_positions=[0, 1], group_average=True)
aa.display_df(df_feat, n_cols=-1)
  mean_dif_Group
1 0.000000
2 0.000000
3 -0.118445
4 -0.069233
5 -0.081443