SequenceFeature.get_seq_kws

SequenceFeature.get_seq_kws(df_seq=None, df_parts=None, sample=None)[source]

Get the per-part sequence keyword arguments (jmd_n_seq, tmd_seq, jmd_c_seq) for one protein.

Returns the TMD-JMD sequence parts of a single protein as a ready-to-use seq_kws dictionary, so the per-protein sequence can be passed directly to the CPPPlot methods (e.g. for sample-level plots) via **seq_kws, without manually slicing the DataFrame. The parts are taken from df_parts (the same sequence parts that produced df_feat via CPP.run()), so the displayed residues are always bound to the feature geometry; df_seq is cross-checked for consistency. The JMD lengths are read off df_parts (no length argument); a JMD that the parts do not contain is returned empty.

Parameters:
  • df_seq (pd.DataFrame, shape (n_samples, n_seq_info)) – DataFrame containing an entry column with unique protein identifiers and sequence information in a distinct format: Position-based, Part-based, Sequence-based, or Sequence-TMD-based.

  • df_parts (pd.DataFrame, shape (n_samples, n_parts)) – Sequence parts DataFrame (indexed by entry) as produced by SequenceFeature.get_df_parts() and passed to CPP.run(). Defines the TMD-JMD geometry; must be consistent with df_seq.

  • sample (int or str) – The protein to extract, given either as a row position in df_parts or as an entry name (str) from its index.

Returns:

seq_kws – Dictionary with the keys jmd_n_seq, tmd_seq, and jmd_c_seq mapping to the corresponding amino acid sequence parts of the selected protein (empty string where a JMD part is not encoded in df_parts).

Return type:

dict

See also

Examples

The SequenceFeature().get_seq_kws() method returns the JMD-N, TMD, and JMD-C sequence parts of a single protein as a ready-to-use seq_kws dictionary. The parts are taken from df_parts (the same parts passed to :meth:CPP.run), so the displayed residues stay bound to the feature geometry; df_seq is cross-checked for consistency. This is the per-protein sequence input for sample-level plots such as :meth:CPPPlot.profile and :meth:CPPPlot.feature_map (e.g. for SHAP-based explanations), passed via **seq_kws.

We first load the DOM_GSEC example dataset and build its sequence parts:

import pandas as pd
import aaanalysis as aa
aa.options["verbose"] = False  # Disable verbosity

df_seq = aa.load_dataset(name="DOM_GSEC", n=3)
sf = aa.SequenceFeature()
df_parts = sf.get_df_parts(df_seq=df_seq)
aa.display_df(df_seq, n_rows=10, show_shape=True)
DataFrame shape: (6, 8)
  entry sequence label tmd_start tmd_stop jmd_n tmd jmd_c
1 Q14802 MQKVTLGLLVFLAGF...PGETPPLITPGSAQS 0 37 59 NSPFYYDWHS LQVGGLICAGVLCAMGIIIVMSA KCKCKFGQKS
2 Q86UE4 MAARSWQDELAQQAE...SPKQIKKKKKARRET 0 50 72 LGLEPKRYPG WVILVGTGALGLLLLFLLGYGWA AACAGARKKR
3 Q969W9 MHRLMGVNSTAAAAA...AIWSKEKDKQKGHPL 0 41 63 FQSMEITELE FVQIIIIVVVMMVMVVVITCLLS HYKLSARSFI
4 P05067 MLPGLALLLLAAWTA...GYENPTYKFFEQMQN 1 701 723 FAEDVGSNKG AIIGLMVGGVVIATVIVITLVML KKKQYTSIHH
5 P14925 MAGRARSGLLLLLLG...EEEYSAPLPKPAPSS 1 868 890 KLSTEPGSGV SVVLITTLLVIPVLVLLAIVMFI RWKKSRAFGD
6 P70180 MRSLLLFTFSACVLL...RELREDSIRSHFSVA 1 477 499 PCKSSGGLEE SAVTGIVVGALLGAGLLMAFYFF RKKYRITIER

Select a protein by its entry (accession) via the sample parameter, providing the df_seq and df_parts. The returned dictionary carries the jmd_n_seq, tmd_seq, and jmd_c_seq parts:

seq_kws = sf.get_seq_kws(df_seq=df_seq, df_parts=df_parts, sample="P05067")  # APP
aa.display_df(pd.DataFrame([seq_kws]), char_limit=50)
  jmd_n_seq tmd_seq jmd_c_seq
1 FAEDVGSNKG AIIGLMVGGVVIATVIVITLVML KKKQYTSIHH

Alternatively, a protein can be selected by its row position in df_parts:

seq_kws = sf.get_seq_kws(df_seq=df_seq, df_parts=df_parts, sample=0)
aa.display_df(pd.DataFrame([seq_kws]), char_limit=50)
  jmd_n_seq tmd_seq jmd_c_seq
1 NSPFYYDWHS LQVGGLICAGVLCAMGIIIVMSA KCKCKFGQKS

There is no JMD-length argument: the flanking JMD lengths are read off df_parts, so the residues always match the features. To change the flanks, build df_parts with different jmd_n_len / jmd_c_len (the same parts you would then pass to :meth:CPP.run):

df_parts_5 = sf.get_df_parts(df_seq=df_seq, jmd_n_len=5, jmd_c_len=5)
seq_kws = sf.get_seq_kws(df_seq=df_seq, df_parts=df_parts_5, sample="P05067")
aa.display_df(pd.DataFrame([seq_kws]), char_limit=50)
  jmd_n_seq tmd_seq jmd_c_seq
1 GSNKG AIIGLMVGGVVIATVIVITLVML KKKQY

The returned dictionary is designed to be splatted directly into the sample-level plotting methods, removing the manual get_df_parts slicing glue:

cpp_plot.profile(df_feat=df_feat, col_imp="feat_impact_APP", shap_plot=True, **seq_kws)
cpp_plot.feature_map(df_feat=df_feat, name_test="APP", shap_plot=True, **seq_kws)