ShapModel.add_sample_mean_dif
- static ShapModel.add_sample_mean_dif(X, labels, label_ref=0, *, df_feat, drop=False, samples=None, names=None, group_average=False, df_seq=None, sample_positions=None)[source]
Compute the feature value difference between selected samples and a reference group average.
Three different scenarios for computing the difference with the reference group average (MEAN_REF) are possible:
Single sample: Computes the difference for a selected sample and MEAN_REF.
Multiple samples: Computes differences for multiple selected samples (all by default) individually against MEAN_REF.
Group of samples: Computes the difference between the average of a selected group of samples and MEAN_REF.
The calculated differences are added to
df_featas new columns namedmean_dif_'name(s)', corresponding to each sample or group.Added in version 0.1.0.
- Parameters:
X (array-like, shape (n_samples, n_features)) – Feature matrix. Rows typically correspond to proteins and columns to features.
labels (array-like, shape (n_samples)) – Class labels for samples in
X(typically, 1=positive, 0=negative).label_ref (int, default=0,) – Class label of reference group in
labels.df_feat (pd.DataFrame, shape (n_features, n_feature_info)) – Feature DataFrame with a unique identifier, scale information, statistics, and positions for each feature.
drop (bool, default=False) – If
True, allow dropping of already existing sample specific mean difference columns fromdf_featbefore inserting.samples (int, list of int, str, list of str, or None) – Sample(s) to compute the difference for, given either as row position(s) in
Xor as entry name(s) (str) from theentrycolumn ofdf_seq(resolved to the matching row(s)). IfNone, the difference for each sample will be returned.names (str or list of str, optional) –
Unique name(s) used for the feature value differences columns. When provided, they should align with
samplesas follows:Single sample:
nameas string andsamplesas integer or entry name.Multiple samples:
nameas list of string andsamplesas corresponding list.Group:
nameas string andsamplesas list, each indicating a group sample.
If
samplesisNone(all samples are considered),namemust be list with names for each sample. Whensamplesis given as entry name(s) andnamesisNone, the entry name(s) are used.group_average (bool, default=False) – If
True, compute the average of samples given bysamples.df_seq (pd.DataFrame, shape (n_samples, n_seq_info), optional) – DataFrame containing an
entrycolumn with a unique protein identifier per row, row-aligned toX. Required only whensamplesis given as entry name(s).sample_positions (int, list of int, str, list of str, or None) – Deprecated alias for
samples(removed in 1.2.0).
- Returns:
df_feat – Feature DataFrame including feature value difference. If the feature value difference is computed for multiple samples, n=number of samples; n=1, otherwise.
- Return type:
pd.DataFrame, shape (n_features, n_feature_info+n)
Examples
To demonstrate the
ShapModel().add_sample_mean_dif()method, we obtain the DOM_GSEC example dataset and its respective feature set (see [Breimann25]):import aaanalysis as aa aa.options["verbose"] = False # Disable verbosity df_seq = aa.load_dataset(name="DOM_GSEC", n=3) labels = df_seq["label"].to_list() df_feat = aa.load_features(name="DOM_GSEC").head(5) # Create feature matrix sf = aa.SequenceFeature() df_parts = sf.get_df_parts(df_seq=df_seq) X = sf.feature_matrix(features=df_feat["feature"], df_parts=df_parts) aa.display_df(df_seq)
/Users/stephanbreimann/Programming/1Packages/aaanalysis-shap-acc/aaanalysis/feature_engineering/_backend/cpp_run.py:143: UserWarning: CPP is using the Python kernel fallback — the compiled Cython extension is not available in this install. Output is bit-exact with the Cython path but ~2x slower. Reinstall via pip install --force-reinstall aaanalysis to fetch a prebuilt wheel. warnings.warn(
entry sequence label tmd_start tmd_stop jmd_n tmd jmd_c 1 Q14802 MQKVTLGLLVFLAGF...PGETPPLITPGSAQS 0 37 59 NSPFYYDWHS LQVGGLICAGVLCAMGIIIVMSA KCKCKFGQKS 2 Q86UE4 MAARSWQDELAQQAE...SPKQIKKKKKARRET 0 50 72 LGLEPKRYPG WVILVGTGALGLLLLFLLGYGWA AACAGARKKR 3 Q969W9 MHRLMGVNSTAAAAA...AIWSKEKDKQKGHPL 0 41 63 FQSMEITELE FVQIIIIVVVMMVMVVVITCLLS HYKLSARSFI 4 P05067 MLPGLALLLLAAWTA...GYENPTYKFFEQMQN 1 701 723 FAEDVGSNKG AIIGLMVGGVVIATVIVITLVML KKKQYTSIHH 5 P14925 MAGRARSGLLLLLLG...EEEYSAPLPKPAPSS 1 868 890 KLSTEPGSGV SVVLITTLLVIPVLVLLAIVMFI RWKKSRAFGD 6 P70180 MRSLLLFTFSACVLL...RELREDSIRSHFSVA 1 477 499 PCKSSGGLEE SAVTGIVVGALLGAGLLMAFYFF RKKYRITIER You need to provide
X,labels, anddf_featto theShapModel().add_samples_mean_dif()method, which will then compute the feature value difference for each sample against the reference group average:sm = aa.ShapModel() # Compute difference against average for negative (0) group df_feat = sm.add_sample_mean_dif(X, labels=labels, df_feat=df_feat) aa.display_df(df_feat)
feature category subcategory scale_name scale_description abs_auc abs_mean_dif mean_dif std_test std_ref p_val_mann_whitney p_val_fdr_bh positions feat_importance feat_importance_std mean_dif_Protein0 mean_dif_Protein1 mean_dif_Protein2 mean_dif_Protein3 mean_dif_Protein4 mean_dif_Protein5 1 TMD_C_JMD_C-Seg...3,4)-KLEP840101 Energy Charge Charge Net charge (Kle...n et al., 1984) 0.244000 0.103666 0.103666 0.106692 0.110506 0.000000 0.000000 31,32,33,34,35 0.970400 1.438918 0.100000 -0.100000 0.000000 0.200000 0.200000 0.200000 2 TMD_C_JMD_C-Seg...3,4)-FINA910104 Conformation α-helix (C-cap) α-helix termination Helix terminati...n et al., 1991) 0.243000 0.085064 0.085064 0.098774 0.096946 0.000000 0.000000 31,32,33,34,35 0.000000 0.000000 0.087600 -0.087600 0.000000 0.175200 0.175200 0.175200 3 TMD_C_JMD_C-Seg...6,9)-LEVM760105 Shape Side chain length Side chain length Radius of gyrat... (Levitt, 1976) 0.233000 0.137044 0.137044 0.161683 0.176964 0.000000 0.000001 32,33 1.554800 2.109848 0.123890 -0.360780 0.236890 0.282890 0.362890 0.338560 4 TMD_C_JMD_C-Seg...3,4)-HUTJ700102 Energy Entropy Entropy Absolute entrop...Hutchens, 1970) 0.229000 0.098224 0.098224 0.106865 0.124608 0.000000 0.000001 31,32,33,34,35 3.111200 3.109955 0.131267 -0.269733 0.138467 0.231467 0.312467 0.277867 5 TMD_C_JMD_C-Seg...6,9)-RADA880106 ASA/Volume Volume Accessible surface area (ASA) Accessible surf...olfenden, 1988) 0.223000 0.095071 0.095071 0.114758 0.132829 0.000000 0.000002 32,33 0.000000 0.000000 0.067557 -0.230443 0.162887 0.165557 0.278227 0.208887 To change the reference group, use the
label_refparameter (default=0). Sincedf_featalready contains mean difference columns, we must setdrop=Trueto remove them:# Compute difference against average for positive (1) group df_feat = sm.add_sample_mean_dif(X, labels=labels, label_ref=1, df_feat=df_feat, drop=True) aa.display_df(df_feat, n_cols=-6)
mean_dif_Protein0 mean_dif_Protein1 mean_dif_Protein2 mean_dif_Protein3 mean_dif_Protein4 mean_dif_Protein5 1 -0.100000 -0.300000 -0.200000 -0.000000 -0.000000 -0.000000 2 -0.087600 -0.262800 -0.175200 -0.000000 -0.000000 -0.000000 3 -0.204223 -0.688893 -0.091223 -0.045223 0.034777 0.010447 4 -0.142667 -0.543667 -0.135467 -0.042467 0.038533 0.003933 5 -0.150000 -0.448000 -0.054670 -0.052000 0.060670 -0.008670 Select a specific sample based in its position index in label using the
samplesparameter. You can provide its name by thenamesparameter:# Single sample df_feat = sm.add_sample_mean_dif(X, labels=labels, df_feat=df_feat, drop=True, samples=0, names="Selected_sample") aa.display_df(df_feat, n_cols=-1)
mean_dif_Selected_sample 1 0.100000 2 0.087600 3 0.123890 4 0.131267 5 0.067557 Samples can also be selected by accession instead of position. Provide
df_seqand pass entry name(s) tosamples; they are resolved to the matching row(s), and the difference column is named after the accession unlessnamesis given:# Select a sample by its entry/accession via df_seq entry = df_seq["entry"].iloc[0] df_feat = sm.add_sample_mean_dif(X, labels=labels, df_feat=df_feat, drop=True, df_seq=df_seq, samples=entry) aa.display_df(df_feat, n_cols=-1)
mean_dif_Q14802 1 0.100000 2 0.087600 3 0.123890 4 0.131267 5 0.067557 Three different scenarios are possible:
Single sample: Compute the difference for a single sample (above).
Multiple samples: Compute the difference for multiple samples (all by default).
Group of samples: Compute the difference using the average of a group of samples.
To target on specific samples, define their indices in
samples. Ensure thenamesparameter, if used, corresponds in length tosamples.# Multiple samples df_feat = sm.add_sample_mean_dif(X, labels=labels, df_feat=df_feat, drop=True, samples=[0, 1], names=["Sample 1", "Sample 2"]) aa.display_df(df_feat, n_cols=-2)
mean_dif_Sample 1 mean_dif_Sample 2 1 0.100000 -0.100000 2 0.087600 -0.087600 3 0.123890 -0.360780 4 0.131267 -0.269733 5 0.067557 -0.230443 To compute the group average, set
group_average=Trueand specify the sample indices insample_positions.Assign a name to the group using thenamesparameter; if not provided, ‘Group’ will be used as the default name:# Group of samples df_feat = sm.add_sample_mean_dif(X, labels=labels, df_feat=df_feat, drop=True, samples=[0, 1], group_average=True) aa.display_df(df_feat, n_cols=-1)
mean_dif_Group 1 0.000000 2 0.000000 3 -0.118445 4 -0.069233 5 -0.081443