SequenceFeature.get_df_parts

SequenceFeature.get_df_parts(df_seq, list_parts=None, all_parts=False, jmd_n_len=10, jmd_c_len=10, tmd_len=None, remove_entries_with_gaps=False, replace_non_canonical_aa=False)[source]

Create DataFrame with selected sequence parts.

Slices each protein sequence in df_seq into the requested Parts (target middle domain (TMD), JMD-N, JMD-C, and combinations thereof) using the boundary information supplied with the sequences. The resulting df_parts DataFrame is the primary sequence input for CPP and for SequenceFeature.feature_matrix().

Added in version 0.1.0.

Changed in version 1.1.0: Added the pos-anchor input mode (tmd_len).

Parameters:

df_seq (pd.DataFrame, shape (n_samples, n_seq_info)) – DataFrame containing an entry column with unique protein identifiers and sequence information in a distinct format: Position-based, Part-based, Sequence-based, or Sequence-TMD-based.
list_parts (list of str, default={tmd, jmd_n_tmd_n, tmd_c_jmd_c}) – Names of sequence parts that should be obtained for sequences from df_seq.
jmd_n_len (int, default=10) – Length of JMD-N in number of amino acids. If None, jmd_n and jmd_c should be given.
jmd_c_len (int, default=10) – Length of JMD-C in number of amino acids. If None, jmd_n and jmd_c should be given.
tmd_len (int, optional) – TMD length in amino acids for the Anchor-based format only (a sequence + pos df_seq). Each 1-based anchor in pos is placed at the P1 position of a length-tmd_len TMD (right-heavy for even tmd_len); ignored for the other formats.
all_parts (bool, default=False) – Whether to create DataFrame with all possible sequence parts (if True) or parts given by list_parts.
remove_entries_with_gaps (bool, default=False) – Whether to exclude entries containing missing residues in their sequence parts (if True), usually resulting from sequences being too short.
replace_non_canonical_aa (bool, default=False) – Whether to replace non-canonical amino acids (e.g., ‘X’) by gap (‘-’) symbol.

Returns:

df_parts – Sequence parts DataFrame.

Return type:

pd.DataFrame, shape (n_samples, n_parts)

See also

aaanalysis.SequenceFeature for definition of parts, and lists of all existing and default parts.

Notes

If ext_len in aaanalysis.options is not set to > 0, following parts containing extended tmd are not considered for all_parts=True: [‘tmd_e’, ‘ext_c’, ‘ext_n’, ‘ext_n_tmd_n’, ‘tmd_c_ext_c’].
jmd_n_len and jmd_c_len must be both given, except for the part-based format.
tmd_start and tmd_stop use 1-based indexing to follow standard biological annotation conventions (e.g., UniProt), where residue positions start at 1. This allows direct use of annotated positions without conversion.

Formats for df_seq are differentiated by their respective columns:

Position-based format

‘sequence’: The complete amino acid sequence.
‘tmd_start’: Starting position of the TMD in the sequence (1-based, inclusive).
‘tmd_stop’: Ending position of the TMD in the sequence (1-based, inclusive).

Part-based format

‘jmd_n’: Amino acid sequence for JMD-N.
‘tmd’: Amino acid sequence for TMD.
‘jmd_c’: Amino acid sequence for JMD-C.

Sequence-TMD-based format

‘sequence’ and ‘tmd’ columns.

Sequence-based format

Only the ‘sequence’ column.

Anchor-based format

‘sequence’ and ‘pos’ columns, together with the tmd_len argument.
‘pos’: per-row 1-based P1 anchor position(s) — a single int or a list[int]. Each anchor is exploded into one row whose TMD is centered (right-heavy for even tmd_len) on the anchor; multi-anchor rows yield multiple rows, ided in the index by <entry>_<win_start>-<win_stop>.

Examples

To demonstrate the SequenceFeature().get_df_parts(), we first obtain an example sequence dataset using the load_dataset() function:

import aaanalysis as aa
sf = aa.SequenceFeature()
df_seq = aa.load_dataset(name="SEQ_CAPSID")
aa.display_df(df_seq, n_rows=3, show_shape=True, char_limit=15)

DataFrame shape: (7862, 3)

	entry	sequence
1	CAPSID_1	MVTHNVK...KLDEENV
2	CAPSID_2	MKKRQKK...ARHFGEE
3	CAPSID_3	MRYGGSV...QEVELVD

By default, three sequence parts (tmd, jmd_n_tmd_n, tmd_c_jmd_c) with a jmd_n and jmd_c length of each 10 residues are provided:

df_parts = sf.get_df_parts(df_seq=df_seq)
aa.display_df(df=df_parts, n_rows=5, show_shape=True, char_limit=15)

DataFrame shape: (7862, 3)

	tmd	jmd_n_tmd_n	tmd_c_jmd_c
entry
CAPSID_1	HVTRRSY...DDDTPRI	MVTHNVK...FTNPTVT	ARHGDNN...KLDEENV
CAPSID_2	SNFTDTS...RMAMLEA	MKKRQKK...MTIHEEF	GHFDGLS...ARHFGEE
CAPSID_3	ELEVSLH...KSPGSGA	MRYGGSV...YSAAALS	AAAELSA...QEVELVD
CAPSID_4	VGRHRRI...KRRQALE	MERGDIP...KAEDVSK	YQRIRDE...EMDAGLI
CAPSID_5	LILFTQV...QMPSGCV	MKRIYLL...EMIKSTP	YGNTITL...ITELTHQ

Any combination of valid sequence parts can be obtained using the list_part parameter:

df_parts = sf.get_df_parts(df_seq=df_seq, list_parts=['jmd_n', 'tmd', 'jmd_c', 'tmd_jmd'])
aa.display_df(df=df_parts, n_rows=3, show_shape=True, char_limit=15)

DataFrame shape: (7862, 4)

	jmd_n	tmd	jmd_c	tmd_jmd
entry
CAPSID_1	MVTHNVKINK	HVTRRSY...DDDTPRI	PATKLDEENV	MVTHNVK...KLDEENV
CAPSID_2	MKKRQKKMTL	SNFTDTS...RMAMLEA	VINARHFGEE	MKKRQKK...ARHFGEE
CAPSID_3	MRYGGSVISQ	ELEVSLH...KSPGSGA	PDKQEVELVD	MRYGGSV...QEVELVD

Set the length of both JMDs by the jmd_c_len and jmd_n_len parameters:

df_parts = sf.get_df_parts(df_seq=df_seq, list_parts=['jmd_n', 'tmd', 'jmd_c', 'tmd_jmd'], jmd_c_len=5, jmd_n_len=20)
aa.display_df(df=df_parts, n_rows=3, show_shape=True, char_limit=15)

DataFrame shape: (7862, 4)

	jmd_n	tmd	jmd_c	tmd_jmd
entry
CAPSID_1	MVTHNVK...RRSYSSA	KEVLEIP...RIPATKL	DEENV	MVTHNVK...KLDEENV
CAPSID_2	MKKRQKK...TDTSFQD	FVSAEQV...EAVINAR	HFGEE	MKKRQKK...ARHFGEE
CAPSID_3	MRYGGSV...VSLHMAF	VEARSAR...GAPDKQE	VELVD	MRYGGSV...QEVELVD

A JMD length of 0 is indicated by ‘…’:

df_parts = sf.get_df_parts(df_seq=df_seq, list_parts=['jmd_n', 'jmd_n_tmd_n'], jmd_n_len=0)
aa.display_df(df=df_parts, n_rows=3, show_shape=True, char_limit=15)

DataFrame shape: (7862, 2)

	jmd_n	jmd_n_tmd_n
entry
CAPSID_1		MVTHNVK...DDGSFTN
CAPSID_2		MKKRQKK...MMLDLMT
CAPSID_3		MRYGGSV...EHHNVRY

To select all possible parts, set all_parts=True:

df_parts = sf.get_df_parts(df_seq=df_seq, all_parts=True)
aa.display_df(df=df_parts, n_rows=3, show_shape=True, char_limit=15)

DataFrame shape: (7862, 8)

	tmd	tmd_n	tmd_c	jmd_n	jmd_c	tmd_jmd	jmd_n_tmd_n	tmd_c_jmd_c
entry
CAPSID_1	HVTRRSY...DDDTPRI	HVTRRSY...FTNPTVT	ARHGDNN...DDDTPRI	MVTHNVKINK	PATKLDEENV	MVTHNVK...KLDEENV	MVTHNVK...FTNPTVT	ARHGDNN...KLDEENV
CAPSID_2	SNFTDTS...RMAMLEA	SNFTDTS...MTIHEEF	GHFDGLS...RMAMLEA	MKKRQKKMTL	VINARHFGEE	MKKRQKK...ARHFGEE	MKKRQKK...MTIHEEF	GHFDGLS...ARHFGEE
CAPSID_3	ELEVSLH...KSPGSGA	ELEVSLH...YSAAALS	AAAELSA...KSPGSGA	MRYGGSVISQ	PDKQEVELVD	MRYGGSV...QEVELVD	MRYGGSV...YSAAALS	AAAELSA...QEVELVD

Entries with sequence gaps can be removed setting remove_entries_with_gaps=True:

n_total = len(df_parts)
# Disable warning that entries have been removed and re-instantiate SequenceFeature
aa.options["verbose"] = False
sf = aa.SequenceFeature()
df_parts = sf.get_df_parts(df_seq=df_seq, remove_entries_with_gaps=True)
n_removed = n_total - len(df_parts)
print(f"{n_removed} sequence with gaps were removed")

5 sequence with gaps were removed

SequenceFeature().get_df_parts() works with four different df_seq formats, which we demonstrate using the DOM_GSEC domain level γ-secretase substrates dataset (see [Breimann25]). Next to the common ‘entry’, ‘sequence’, and ‘label’ columns, this dataset provides columns for the TMD start and stop position (‘tmd_start’, ‘tmd_stop’) and the default sequence parts ‘jmd_n’, ‘tmd’, ‘jmd_c’:

df_seq = aa.load_dataset(name="DOM_GSEC")
aa.display_df(df_seq, n_rows=5)

	entry	sequence	label	tmd_start	tmd_stop	jmd_n	tmd	jmd_c
1	P05067	MLPGLALLLLAAWTA...GYENPTYKFFEQMQN	1	701	723	FAEDVGSNKG	AIIGLMVGGVVIATVIVITLVML	KKKQYTSIHH
2	P14925	MAGRARSGLLLLLLG...EEEYSAPLPKPAPSS	1	868	890	KLSTEPGSGV	SVVLITTLLVIPVLVLLAIVMFI	RWKKSRAFGD
3	P70180	MRSLLLFTFSACVLL...RELREDSIRSHFSVA	1	477	499	PCKSSGGLEE	SAVTGIVVGALLGAGLLMAFYFF	RKKYRITIER
4	Q03157	MGPTSPAARGQGRRW...HGYENPTYRFLEERP	1	585	607	APSGTGVSRE	ALSGLLIMGAGGGSLIVLSLLLL	RKKKPYGTIS
5	Q06481	MAATGTAAAAATGRL...GYENPTYKYLEQMQI	1	694	716	LREDFSLSSS	ALIGLLVIAVAIATVIVISLVML	RKRQYGTISH

Position-based format:

cols_position_based = ["entry", "sequence", "tmd_start", "tmd_stop"]
df_seq_position_based = df_seq[cols_position_based]
list_parts = ["jmd_n", "tmd", "jmd_c"]
df_parts = sf.get_df_parts(df_seq=df_seq_position_based, list_parts=list_parts)
aa.display_df(df_parts, n_rows=3)

	jmd_n	tmd	jmd_c
entry
P05067	FAEDVGSNKG	AIIGLMVGGVVIATVIVITLVML	KKKQYTSIHH
P14925	KLSTEPGSGV	SVVLITTLLVIPVLVLLAIVMFI	RWKKSRAFGD
P70180	PCKSSGGLEE	SAVTGIVVGALLGAGLLMAFYFF	RKKYRITIER

# tmd_start and tmd_stop are 1-based, so that the sequences are extracted as in starndard biological
# annotation formats (e.g., UniProt). This allows direct use of annotated positions without conversion.
df_seq_position_based["tmd_start"] = 1
df_seq_position_based["tmd_stop"] = 5
print("Input sequence dataset with adjusted tmd_start and tmd_stop postions")
aa.display_df(df_seq_position_based, n_rows=3)

print("df_parts with TMD starting at position 1 and ending at position 5")
df_parts = sf.get_df_parts(df_seq=df_seq_position_based, list_parts=["tmd"])
aa.display_df(df_parts, n_rows=3)

Input sequence dataset with adjusted tmd_start and tmd_stop postions

	entry	sequence	tmd_start	tmd_stop
1	P05067	MLPGLALLLLAAWTA...GYENPTYKFFEQMQN	1	5
2	P14925	MAGRARSGLLLLLLG...EEEYSAPLPKPAPSS	1	5
3	P70180	MRSLLLFTFSACVLL...RELREDSIRSHFSVA	1	5

df_parts with TMD starting at position 1 and ending at position 5

	tmd
entry
P05067	MLPGL
P14925	MAGRA
P70180	MRSLL

Part-based format:

cols_part_based = ["entry", "jmd_n", "tmd", "jmd_c"]
df_parts = sf.get_df_parts(df_seq=df_seq[cols_part_based], list_parts=list_parts)
aa.display_df(df_parts, n_rows=3)

	jmd_n	tmd	jmd_c
entry
P05067	FAEDVGSNKG	AIIGLMVGGVVIATVIVITLVML	KKKQYTSIHH
P14925	KLSTEPGSGV	SVVLITTLLVIPVLVLLAIVMFI	RWKKSRAFGD
P70180	PCKSSGGLEE	SAVTGIVVGALLGAGLLMAFYFF	RKKYRITIER

Sequence-TMD-based format:

cols_sequence_tmd_based = ["entry", "sequence", "tmd"]
df_parts = sf.get_df_parts(df_seq=df_seq[cols_sequence_tmd_based], list_parts=list_parts)
aa.display_df(df_parts, n_rows=3)

	jmd_n	tmd	jmd_c
entry
P05067	FAEDVGSNKG	AIIGLMVGGVVIATVIVITLVML	KKKQYTSIHH
P14925	KLSTEPGSGV	SVVLITTLLVIPVLVLLAIVMFI	RWKKSRAFGD
P70180	PCKSSGGLEE	SAVTGIVVGALLGAGLLMAFYFF	RKKYRITIER

Sequence-based format:

cols_sequence_based = ["entry", "sequence"]
df_parts = sf.get_df_parts(df_seq=df_seq[cols_sequence_based], list_parts=list_parts)
# Only providing the sequence will create flanking jmd_n and jmd_c regions defined by
# their length and the tmd as the remaining middle part
aa.display_df(df_parts, n_rows=3)

	jmd_n	tmd	jmd_c
entry
P05067	MLPGLALLLL	AAWTARALEVPTDGN...ERHLSKMQQNGYENP	TYKFFEQMQN
P14925	MAGRARSGLL	LLLLGLLALQSSCLA...EKDEDDGTESEEEYS	APLPKPAPSS
P70180	MRSLLLFTFS	ACVLLARVLLAGGAS...QQEESNIGKHRELRE	DSIRSHFSVA

get_df_parts also accepts replace_non_canonical_aa (replace any non-canonical amino acid with the gap symbol so scale lookups stay valid) and, for position-based input, tmd_len (the assumed TMD length when reconstructing the parts):

df_parts_clean = sf.get_df_parts(df_seq=df_seq[["entry", "sequence"]],
                                 list_parts=["tmd_jmd"], tmd_len=20,
                                 replace_non_canonical_aa=True)
aa.display_df(df_parts_clean, n_rows=3, show_shape=True, char_limit=15)

DataFrame shape: (126, 1)

	tmd_jmd
entry
P05067	MLPGLAL...FFEQMQN
P14925	MAGRARS...PKPAPSS
P70180	MRSLLLF...RSHFSVA