aaanalysis.ShapModel.add_sample_mean_dif
- static ShapModel.add_sample_mean_dif(X, labels=None, label_ref=0, df_feat=None, drop=False, sample_positions=None, names=None, group_average=False)[source]
Compute the feature value difference between selected samples and a reference group average.
Three different scenarios for computing the difference with the reference group average (MEAN_REF) are possible:
Single sample: Computes the difference for a selected sample and MEAN_REF.
Multiple samples: Computes differences for multiple selected samples (all by default) individually against MEAN_REF.
Group of samples: Computes the difference between the average of a selected group of samples and MEAN_REF.
The calculated differences are added to
df_featas new columns namedmean_dif_'name(s)', corresponding to each sample or group.- Parameters:
X (array-like, shape (n_samples, n_features)) – Feature matrix. Rows typically correspond to proteins and columns to features.
labels (array-like, shape (n_samples)) – Class labels for samples in
X(typically, 1=positive, 0=negative).label_ref (int, default=0,) – Class label of reference group in
labels.df_feat (pd.DataFrame, shape (n_features, n_feature_info)) – Feature DataFrame with a unique identifier, scale information, statistics, and positions for each feature.
drop (bool, default=False) – If
True, allow dropping of already existing sample specific mean difference columns fromdf_featbefore inserting.sample_positions (int, list of int, or None) – Position index/indices for the sample(s) in
shap_values. IfNone, the impact for each sample will be returned.names (str or list of str, optional) –
Unique name(s) used for the feature value differences columns. When provided, they should align with
sample_positionsas follows:Single sample:
nameas string andsample_positionsas integer.Multiple samples:
nameas list of string andsample_positionsas corresponding list of integers.Group:
nameas string andsample_positionsas list of integers, each indicating a group sample.
If
sample_positionsisNone(all samples are considered),namemust be list with names for each sample.group_average (bool, default=False) – If
True, compute the average of samples given bysample_positions.
- Returns:
df_feat – Feature DataFrame including feature value difference. If the feature value difference is computed for multiple samples, n=number of samples; n=1, otherwise.
- Return type:
pd.DataFrame, shape (n_features, n_feature_info+n)
Examples
To demonstrate the
ShapModel().add_sample_mean_dif()method, we obtain the DOM_GSEC example dataset and its respective feature set (see [Breimann25a]):import aaanalysis as aa aa.options["verbose"] = False # Disable verbosity df_seq = aa.load_dataset(name="DOM_GSEC", n=3) labels = df_seq["label"].to_list() df_feat = aa.load_features(name="DOM_GSEC").head(5) # Create feature matrix sf = aa.SequenceFeature() df_parts = sf.get_df_parts(df_seq=df_seq) X = sf.feature_matrix(features=df_feat["feature"], df_parts=df_parts) aa.display_df(df_seq)
entry sequence label tmd_start tmd_stop jmd_n tmd jmd_c 1 Q14802 MQKVTLGLLVFLAGF...PGETPPLITPGSAQS 0 37 59 NSPFYYDWHS LQVGGLICAGVLCAMGIIIVMSA KCKCKFGQKS 2 Q86UE4 MAARSWQDELAQQAE...SPKQIKKKKKARRET 0 50 72 LGLEPKRYPG WVILVGTGALGLLLLFLLGYGWA AACAGARKKR 3 Q969W9 MHRLMGVNSTAAAAA...AIWSKEKDKQKGHPL 0 41 63 FQSMEITELE FVQIIIIVVVMMVMVVVITCLLS HYKLSARSFI 4 P05067 MLPGLALLLLAAWTA...GYENPTYKFFEQMQN 1 701 723 FAEDVGSNKG AIIGLMVGGVVIATVIVITLVML KKKQYTSIHH 5 P14925 MAGRARSGLLLLLLG...EEEYSAPLPKPAPSS 1 868 890 KLSTEPGSGV SVVLITTLLVIPVLVLLAIVMFI RWKKSRAFGD 6 P70180 MRSLLLFTFSACVLL...RELREDSIRSHFSVA 1 477 499 PCKSSGGLEE SAVTGIVVGALLGAGLLMAFYFF RKKYRITIER You need to provide
X,labels, anddf_featto theShapModel().add_samples_mean_dif()method, which will then compute the feature value difference for each sample against the reference group average:sm = aa.ShapModel() # Compute difference against average for negative (0) group df_feat = sm.add_sample_mean_dif(X, labels=labels, df_feat=df_feat) aa.display_df(df_feat)
feature category subcategory scale_name scale_description abs_auc abs_mean_dif mean_dif std_test std_ref p_val_mann_whitney p_val_fdr_bh positions feat_importance feat_importance_std mean_dif_Protein0 mean_dif_Protein1 mean_dif_Protein2 mean_dif_Protein3 mean_dif_Protein4 mean_dif_Protein5 1 TMD_C_JMD_C-Seg...3,4)-KLEP840101 Energy Charge Charge Net charge (Kle...n et al., 1984) 0.244000 0.103666 0.103666 0.106692 0.110506 0.000000 0.000000 31,32,33,34,35 0.970400 1.438918 0.100000 -0.100000 0.000000 0.200000 0.200000 0.200000 2 TMD_C_JMD_C-Seg...3,4)-FINA910104 Conformation α-helix (C-cap) α-helix termination Helix terminati...n et al., 1991) 0.243000 0.085064 0.085064 0.098774 0.096946 0.000000 0.000000 31,32,33,34,35 0.000000 0.000000 0.087600 -0.087600 0.000000 0.175200 0.175200 0.175200 3 TMD_C_JMD_C-Seg...6,9)-LEVM760105 Shape Side chain length Side chain length Radius of gyrat... (Levitt, 1976) 0.233000 0.137044 0.137044 0.161683 0.176964 0.000000 0.000001 32,33 1.554800 2.109848 0.123890 -0.360780 0.236890 0.282890 0.362890 0.338560 4 TMD_C_JMD_C-Seg...3,4)-HUTJ700102 Energy Entropy Entropy Absolute entrop...Hutchens, 1970) 0.229000 0.098224 0.098224 0.106865 0.124608 0.000000 0.000001 31,32,33,34,35 3.111200 3.109955 0.131267 -0.269733 0.138467 0.231467 0.312467 0.277867 5 TMD_C_JMD_C-Seg...6,9)-RADA880106 ASA/Volume Volume Accessible surface area (ASA) Accessible surf...olfenden, 1988) 0.223000 0.095071 0.095071 0.114758 0.132829 0.000000 0.000002 32,33 0.000000 0.000000 0.067557 -0.230443 0.162887 0.165557 0.278227 0.208887 To change the reference group, use the
label_refparameter (default=0). Sincedf_featalready contains mean difference columns, we must setdrop=Trueto remove them:# Compute difference against average for positive (1) group df_feat = sm.add_sample_mean_dif(X, labels=labels, label_ref=1, df_feat=df_feat, drop=True) aa.display_df(df_feat, n_cols=-6)
mean_dif_Protein0 mean_dif_Protein1 mean_dif_Protein2 mean_dif_Protein3 mean_dif_Protein4 mean_dif_Protein5 1 -0.100000 -0.300000 -0.200000 -0.000000 -0.000000 -0.000000 2 -0.087600 -0.262800 -0.175200 -0.000000 -0.000000 -0.000000 3 -0.204223 -0.688893 -0.091223 -0.045223 0.034777 0.010447 4 -0.142667 -0.543667 -0.135467 -0.042467 0.038533 0.003933 5 -0.150000 -0.448000 -0.054670 -0.052000 0.060670 -0.008670 Select a specific sample based in its position index in label using the
sample_positionsparameter. You can provide its name by thenamesparameter:# Single sample df_feat = sm.add_sample_mean_dif(X, labels=labels, df_feat=df_feat, drop=True, sample_positions=0, names="Selected_sample") aa.display_df(df_feat, n_cols=-1)
mean_dif_Selected_sample 1 0.100000 2 0.087600 3 0.123890 4 0.131267 5 0.067557 Three different scenarios are possible:
Single sample: Compute the difference for a single sample (above).
Multiple samples: Compute the difference for multiple samples (all by default).
Group of samples: Compute the difference using the average of a group of samples.
To target on specific samples, define their indices in
sample_positions. Ensure thenamesparameter, if used, corresponds in length tosample_positions.# Multiple samples df_feat = sm.add_sample_mean_dif(X, labels=labels, df_feat=df_feat, drop=True, sample_positions=[0, 1], names=["Sample 1", "Sample 2"]) aa.display_df(df_feat, n_cols=-2)
mean_dif_Sample 1 mean_dif_Sample 2 1 0.100000 -0.100000 2 0.087600 -0.087600 3 0.123890 -0.360780 4 0.131267 -0.269733 5 0.067557 -0.230443 To compute the group average, set
group_average=Trueand specify the sample indices insample_positions.Assign a name to the group using thenamesparameter; if not provided, ‘Group’ will be used as the default name:# Group of samples df_feat = sm.add_sample_mean_dif(X, labels=labels, df_feat=df_feat, drop=True, sample_positions=[0, 1], group_average=True) aa.display_df(df_feat, n_cols=-1)
mean_dif_Group 1 0.000000 2 0.000000 3 -0.118445 4 -0.069233 5 -0.081443