SequenceFeature.get_seq_kws
- SequenceFeature.get_seq_kws(df_seq=None, df_parts=None, sample=None)[source]
Get the per-part sequence keyword arguments (
jmd_n_seq,tmd_seq,jmd_c_seq) for one protein.Returns the TMD-JMD sequence parts of a single protein as a ready-to-use
seq_kwsdictionary, so the per-protein sequence can be passed directly to theCPPPlotmethods (e.g. for sample-level plots) via**seq_kws, without manually slicing the DataFrame. The parts are taken fromdf_parts(the same sequence parts that produceddf_featviaCPP.run()), so the displayed residues are always bound to the feature geometry;df_seqis cross-checked for consistency. The JMD lengths are read offdf_parts(no length argument); a JMD that the parts do not contain is returned empty.- Parameters:
df_seq (pd.DataFrame, shape (n_samples, n_seq_info)) – DataFrame containing an
entrycolumn with unique protein identifiers and sequence information in a distinct format: Position-based, Part-based, Sequence-based, or Sequence-TMD-based.df_parts (pd.DataFrame, shape (n_samples, n_parts)) – Sequence parts DataFrame (indexed by
entry) as produced bySequenceFeature.get_df_parts()and passed toCPP.run(). Defines the TMD-JMD geometry; must be consistent withdf_seq.sample (int or str) – The protein to extract, given either as a row position in
df_partsor as an entry name (str) from its index.
- Returns:
seq_kws – Dictionary with the keys
jmd_n_seq,tmd_seq, andjmd_c_seqmapping to the corresponding amino acid sequence parts of the selected protein (empty string where a JMD part is not encoded indf_parts).- Return type:
See also
SequenceFeature.get_df_parts()for creating the underlying sequence parts DataFrame.CPPPlot.profile()andCPPPlot.feature_map(), which accept the returned parts via**seq_kws.
Examples
The
SequenceFeature().get_seq_kws()method returns the JMD-N, TMD, and JMD-C sequence parts of a single protein as a ready-to-useseq_kwsdictionary. The parts are taken fromdf_parts(the same parts passed to :meth:CPP.run), so the displayed residues stay bound to the feature geometry;df_seqis cross-checked for consistency. This is the per-protein sequence input for sample-level plots such as :meth:CPPPlot.profileand :meth:CPPPlot.feature_map(e.g. for SHAP-based explanations), passed via**seq_kws.We first load the DOM_GSEC example dataset and build its sequence parts:
import pandas as pd import aaanalysis as aa aa.options["verbose"] = False # Disable verbosity df_seq = aa.load_dataset(name="DOM_GSEC", n=3) sf = aa.SequenceFeature() df_parts = sf.get_df_parts(df_seq=df_seq) aa.display_df(df_seq, n_rows=10, show_shape=True)
DataFrame shape: (6, 8)
entry sequence label tmd_start tmd_stop jmd_n tmd jmd_c 1 Q14802 MQKVTLGLLVFLAGF...PGETPPLITPGSAQS 0 37 59 NSPFYYDWHS LQVGGLICAGVLCAMGIIIVMSA KCKCKFGQKS 2 Q86UE4 MAARSWQDELAQQAE...SPKQIKKKKKARRET 0 50 72 LGLEPKRYPG WVILVGTGALGLLLLFLLGYGWA AACAGARKKR 3 Q969W9 MHRLMGVNSTAAAAA...AIWSKEKDKQKGHPL 0 41 63 FQSMEITELE FVQIIIIVVVMMVMVVVITCLLS HYKLSARSFI 4 P05067 MLPGLALLLLAAWTA...GYENPTYKFFEQMQN 1 701 723 FAEDVGSNKG AIIGLMVGGVVIATVIVITLVML KKKQYTSIHH 5 P14925 MAGRARSGLLLLLLG...EEEYSAPLPKPAPSS 1 868 890 KLSTEPGSGV SVVLITTLLVIPVLVLLAIVMFI RWKKSRAFGD 6 P70180 MRSLLLFTFSACVLL...RELREDSIRSHFSVA 1 477 499 PCKSSGGLEE SAVTGIVVGALLGAGLLMAFYFF RKKYRITIER Select a protein by its
entry(accession) via thesampleparameter, providing thedf_seqanddf_parts. The returned dictionary carries thejmd_n_seq,tmd_seq, andjmd_c_seqparts:seq_kws = sf.get_seq_kws(df_seq=df_seq, df_parts=df_parts, sample="P05067") # APP aa.display_df(pd.DataFrame([seq_kws]), char_limit=50)
jmd_n_seq tmd_seq jmd_c_seq 1 FAEDVGSNKG AIIGLMVGGVVIATVIVITLVML KKKQYTSIHH Alternatively, a protein can be selected by its row position in
df_parts:seq_kws = sf.get_seq_kws(df_seq=df_seq, df_parts=df_parts, sample=0) aa.display_df(pd.DataFrame([seq_kws]), char_limit=50)
jmd_n_seq tmd_seq jmd_c_seq 1 NSPFYYDWHS LQVGGLICAGVLCAMGIIIVMSA KCKCKFGQKS There is no JMD-length argument: the flanking JMD lengths are read off
df_parts, so the residues always match the features. To change the flanks, builddf_partswith differentjmd_n_len/jmd_c_len(the same parts you would then pass to :meth:CPP.run):df_parts_5 = sf.get_df_parts(df_seq=df_seq, jmd_n_len=5, jmd_c_len=5) seq_kws = sf.get_seq_kws(df_seq=df_seq, df_parts=df_parts_5, sample="P05067") aa.display_df(pd.DataFrame([seq_kws]), char_limit=50)
jmd_n_seq tmd_seq jmd_c_seq 1 GSNKG AIIGLMVGGVVIATVIVITLVML KKKQY The returned dictionary is designed to be splatted directly into the sample-level plotting methods, removing the manual
get_df_partsslicing glue:cpp_plot.profile(df_feat=df_feat, col_imp="feat_impact_APP", shap_plot=True, **seq_kws) cpp_plot.feature_map(df_feat=df_feat, name_test="APP", shap_plot=True, **seq_kws)