SequenceFeature.get_seq_kws

SequenceFeature.get_seq_kws(df_seq, df_parts, sample)[source]

Get the per-part sequence keyword arguments (jmd_n_seq, tmd_seq, jmd_c_seq) for one protein.

Returns the TMD-JMD sequence parts of a single protein as a ready-to-use seq_kws dictionary, so the per-protein sequence can be passed directly to the CPPPlot methods (e.g. for sample-level plots) via **seq_kws, without manually slicing the DataFrame. The parts are taken from df_parts (the same sequence parts that produced df_feat via CPP.run()), so the displayed residues are always bound to the feature geometry; df_seq is cross-checked for consistency. The JMD lengths are read off df_parts (no length argument); a JMD that the parts do not contain is returned empty.

Parameters:

df_seq (pd.DataFrame, shape (n_samples, n_seq_info)) – DataFrame containing an entry column with unique protein identifiers and sequence information in a distinct format: Position-based, Part-based, Sequence-based, or Sequence-TMD-based.
df_parts (pd.DataFrame, shape (n_samples, n_parts)) – Sequence parts DataFrame (indexed by entry) as produced by SequenceFeature.get_df_parts() and passed to CPP.run(). Defines the TMD-JMD geometry; must be consistent with df_seq.
sample (int or str) – The protein to extract, given either as a row position in df_parts or as an entry name (str) from its index.

Returns:

seq_kws – Dictionary with the keys jmd_n_seq, tmd_seq, and jmd_c_seq mapping to the corresponding amino acid sequence parts of the selected protein (empty string where a JMD part is not encoded in df_parts).

Return type:

dict

See also

SequenceFeature.get_df_parts() for creating the underlying sequence parts DataFrame.
CPPPlot.profile() and CPPPlot.feature_map(), which accept the returned parts via **seq_kws.

Examples

The SequenceFeature().get_seq_kws() method returns the JMD-N, TMD, and JMD-C sequence parts of a single protein as a ready-to-use seq_kws dictionary. The parts are taken from df_parts (the same parts passed to :meth:CPP.run), so the displayed residues stay bound to the feature geometry; df_seq is cross-checked for consistency. This is the per-protein sequence input for sample-level plots such as :meth:CPPPlot.profile and :meth:CPPPlot.feature_map (e.g. for SHAP-based explanations), passed via **seq_kws.

We first load the DOM_GSEC example dataset and build its sequence parts:

import pandas as pd
import aaanalysis as aa
aa.options["verbose"] = False  # Disable verbosity

df_seq = aa.load_dataset(name="DOM_GSEC", n=3)
sf = aa.SequenceFeature()
df_parts = sf.get_df_parts(df_seq=df_seq)
aa.display_df(df_seq, n_rows=10, show_shape=True)

DataFrame shape: (6, 8)

	entry	sequence	label	tmd_start	tmd_stop	jmd_n	tmd	jmd_c
1	Q14802	MQKVTLGLLVFLAGF...PGETPPLITPGSAQS	0	37	59	NSPFYYDWHS	LQVGGLICAGVLCAMGIIIVMSA	KCKCKFGQKS
2	Q86UE4	MAARSWQDELAQQAE...SPKQIKKKKKARRET	0	50	72	LGLEPKRYPG	WVILVGTGALGLLLLFLLGYGWA	AACAGARKKR
3	Q969W9	MHRLMGVNSTAAAAA...AIWSKEKDKQKGHPL	0	41	63	FQSMEITELE	FVQIIIIVVVMMVMVVVITCLLS	HYKLSARSFI
4	P05067	MLPGLALLLLAAWTA...GYENPTYKFFEQMQN	1	701	723	FAEDVGSNKG	AIIGLMVGGVVIATVIVITLVML	KKKQYTSIHH
5	P14925	MAGRARSGLLLLLLG...EEEYSAPLPKPAPSS	1	868	890	KLSTEPGSGV	SVVLITTLLVIPVLVLLAIVMFI	RWKKSRAFGD
6	P70180	MRSLLLFTFSACVLL...RELREDSIRSHFSVA	1	477	499	PCKSSGGLEE	SAVTGIVVGALLGAGLLMAFYFF	RKKYRITIER

Select a protein by its entry (accession) via the sample parameter, providing the df_seq and df_parts. The returned dictionary carries the jmd_n_seq, tmd_seq, and jmd_c_seq parts:

seq_kws = sf.get_seq_kws(df_seq=df_seq, df_parts=df_parts, sample="P05067")  # APP
aa.display_df(pd.DataFrame([seq_kws]), char_limit=50)

	jmd_n_seq	tmd_seq	jmd_c_seq
1	FAEDVGSNKG	AIIGLMVGGVVIATVIVITLVML	KKKQYTSIHH

Alternatively, a protein can be selected by its row position in df_parts:

seq_kws = sf.get_seq_kws(df_seq=df_seq, df_parts=df_parts, sample=0)
aa.display_df(pd.DataFrame([seq_kws]), char_limit=50)

	jmd_n_seq	tmd_seq	jmd_c_seq
1	NSPFYYDWHS	LQVGGLICAGVLCAMGIIIVMSA	KCKCKFGQKS

There is no JMD-length argument: the flanking JMD lengths are read off df_parts, so the residues always match the features. To change the flanks, build df_parts with different jmd_n_len / jmd_c_len (the same parts you would then pass to :meth:CPP.run):

df_parts_5 = sf.get_df_parts(df_seq=df_seq, jmd_n_len=5, jmd_c_len=5)
seq_kws = sf.get_seq_kws(df_seq=df_seq, df_parts=df_parts_5, sample="P05067")
aa.display_df(pd.DataFrame([seq_kws]), char_limit=50)

	jmd_n_seq	tmd_seq	jmd_c_seq
1	GSNKG	AIIGLMVGGVVIATVIVITLVML	KKKQY

The returned dictionary is designed to be splatted directly into the sample-level plotting methods, removing the manual get_df_parts slicing glue:

cpp_plot.profile(df_feat=df_feat, col_imp="feat_impact_APP", shap_plot=True, **seq_kws)
cpp_plot.feature_map(df_feat=df_feat, name_test="APP", shap_plot=True, **seq_kws)