aaanalysis.SequenceFeature.get_df_parts

SequenceFeature.get_df_parts(df_seq=None, list_parts=None, all_parts=False, jmd_n_len=10, jmd_c_len=10, tmd_len=None, remove_entries_with_gaps=False, replace_non_canonical_aa=False)[source]

Create DataFrane with selected sequence parts.

Changed in version 1.1.0: Added the pos-anchor input mode (tmd_len).

Parameters:
  • df_seq (pd.DataFrame, shape (n_samples, n_seq_info)) – DataFrame containing an entry column with unique protein identifiers and sequence information in a distinct format: Position-based, Part-based, Sequence-based, or Sequence-TMD-based.

  • list_parts (list of str, default={tmd, jmd_n_tmd_n, tmd_c_jmd_c}) – Names of sequence parts that should be obtained for sequences from df_seq.

  • jmd_n_len (int, default=10) – Length of JMD-N in number of amino acids. If None, jmd_n and jmd_c should be given.

  • jmd_c_len (int, default=10) – Length of JMD-N in number of amino acids. If None, jmd_n and jmd_c should be given.

  • tmd_len (int, optional) – TMD length in amino acids for the Anchor-based format only (a sequence + pos df_seq). Each 1-based anchor in pos is placed at the P1 position of a length-tmd_len TMD (right-heavy for even tmd_len); ignored for the other formats.

  • all_parts (bool, default=False) – Whether to create DataFrame with all possible sequence parts (if True) or parts given by list_parts.

  • remove_entries_with_gaps (bool, default=False) – Whether to exclude entries containing missing residues in their sequence parts (if True), usually resulting from sequences being too short.

  • replace_non_canonical_aa (bool, default=False) – Whether to replace non-canonical amino acids (e.g., ‘X’) by gap (‘-’) symbol.

Returns:

df_parts – Sequence parts DataFrame.

Return type:

pd.DataFrame, shape (n_samples, n_parts)

See also

Notes

  • If ext_len in aaanalysis.options is not set to > 0, following parts containing extended tmd are not considered for all_parts=True: [‘tmd_e’, ‘ext_c’, ‘ext_n’, ‘ext_n_tmd_n’, ‘tmd_c_ext_c’].

  • jmd_n_len and jmd_c_len must be both given, except for the part-based format.

  • tmd_start and tmd_stop use 1-based indexing to follow standard biological annotation conventions (e.g., UniProt), where residue positions start at 1. This allows direct use of annotated positions without conversion.

Formats for df_seq are differentiated by their respective columns:

Position-based format
  • ‘sequence’: The complete amino acid sequence.

  • ‘tmd_start’: Starting position of the TMD in the sequence (1-based, inclusive).

  • ‘tmd_stop’: Ending position of the TMD in the sequence (1-based, inclusive).

Part-based format
  • ‘jmd_n’: Amino acid sequence for JMD-N.

  • ‘tmd’: Amino acid sequence for TMD.

  • ‘jmd_c’: Amino acid sequence for JMD-C.

Sequence-TMD-based format
  • ‘sequence’ and ‘tmd’ columns.

Sequence-based format
  • Only the ‘sequence’ column.

Anchor-based format
  • ‘sequence’ and ‘pos’ columns, together with the tmd_len argument.

  • ‘pos’: per-row 1-based P1 anchor position(s) — a single int or a list[int]. Each anchor is exploded into one row whose TMD is centered (right-heavy for even tmd_len) on the anchor; multi-anchor rows yield multiple rows, ided in the index by <entry>_<win_start>-<win_stop>.

Examples

To demonstrate the SequenceFeature().get_df_parts(), we first obtain an example sequence dataset using the load_dataset() function:

import aaanalysis as aa
sf = aa.SequenceFeature()
df_seq = aa.load_dataset(name="SEQ_CAPSID")
aa.display_df(df_seq, n_rows=3, show_shape=True, char_limit=15)
DataFrame shape: (7862, 3)
  entry sequence label
1 CAPSID_1 MVTHNVK...KLDEENV 0
2 CAPSID_2 MKKRQKK...ARHFGEE 0
3 CAPSID_3 MRYGGSV...QEVELVD 0

By default, three sequence parts (tmd, jmd_n_tmd_n, tmd_c_jmd_c) with a jmd_n and jmd_c length of each 10 residues are provided:

df_parts = sf.get_df_parts(df_seq=df_seq)
aa.display_df(df=df_parts, n_rows=5, show_shape=True, char_limit=15)
DataFrame shape: (7862, 3)
  tmd jmd_n_tmd_n tmd_c_jmd_c
entry      
CAPSID_1 HVTRRSY...DDDTPRI MVTHNVK...FTNPTVT ARHGDNN...KLDEENV
CAPSID_2 SNFTDTS...RMAMLEA MKKRQKK...MTIHEEF GHFDGLS...ARHFGEE
CAPSID_3 ELEVSLH...KSPGSGA MRYGGSV...YSAAALS AAAELSA...QEVELVD
CAPSID_4 VGRHRRI...KRRQALE MERGDIP...KAEDVSK YQRIRDE...EMDAGLI
CAPSID_5 LILFTQV...QMPSGCV MKRIYLL...EMIKSTP YGNTITL...ITELTHQ

Any combination of valid sequence parts can be obtained using the list_part parameter:

df_parts = sf.get_df_parts(df_seq=df_seq, list_parts=['jmd_n', 'tmd', 'jmd_c', 'tmd_jmd'])
aa.display_df(df=df_parts, n_rows=3, show_shape=True, char_limit=15)
DataFrame shape: (7862, 4)
  jmd_n tmd jmd_c tmd_jmd
entry        
CAPSID_1 MVTHNVKINK HVTRRSY...DDDTPRI PATKLDEENV MVTHNVK...KLDEENV
CAPSID_2 MKKRQKKMTL SNFTDTS...RMAMLEA VINARHFGEE MKKRQKK...ARHFGEE
CAPSID_3 MRYGGSVISQ ELEVSLH...KSPGSGA PDKQEVELVD MRYGGSV...QEVELVD

Set the length of both JMDs by the jmd_c_len and jmd_n_len parameters:

df_parts = sf.get_df_parts(df_seq=df_seq, list_parts=['jmd_n', 'tmd', 'jmd_c', 'tmd_jmd'], jmd_c_len=5, jmd_n_len=20)
aa.display_df(df=df_parts, n_rows=3, show_shape=True, char_limit=15)
DataFrame shape: (7862, 4)
  jmd_n tmd jmd_c tmd_jmd
entry        
CAPSID_1 MVTHNVK...RRSYSSA KEVLEIP...RIPATKL DEENV MVTHNVK...KLDEENV
CAPSID_2 MKKRQKK...TDTSFQD FVSAEQV...EAVINAR HFGEE MKKRQKK...ARHFGEE
CAPSID_3 MRYGGSV...VSLHMAF VEARSAR...GAPDKQE VELVD MRYGGSV...QEVELVD

A JMD length of 0 is indicated by ‘…’:

df_parts = sf.get_df_parts(df_seq=df_seq, list_parts=['jmd_n', 'jmd_n_tmd_n'], jmd_n_len=0)
aa.display_df(df=df_parts, n_rows=3, show_shape=True, char_limit=15)
DataFrame shape: (7862, 2)
  jmd_n jmd_n_tmd_n
entry    
CAPSID_1 MVTHNVK...DDGSFTN
CAPSID_2 MKKRQKK...MMLDLMT
CAPSID_3 MRYGGSV...EHHNVRY

To select all possible parts, set all_parts=True:

df_parts = sf.get_df_parts(df_seq=df_seq, all_parts=True)
aa.display_df(df=df_parts, n_rows=3, show_shape=True, char_limit=15)
DataFrame shape: (7862, 8)
  tmd tmd_n tmd_c jmd_n jmd_c tmd_jmd jmd_n_tmd_n tmd_c_jmd_c
entry                
CAPSID_1 HVTRRSY...DDDTPRI HVTRRSY...FTNPTVT ARHGDNN...DDDTPRI MVTHNVKINK PATKLDEENV MVTHNVK...KLDEENV MVTHNVK...FTNPTVT ARHGDNN...KLDEENV
CAPSID_2 SNFTDTS...RMAMLEA SNFTDTS...MTIHEEF GHFDGLS...RMAMLEA MKKRQKKMTL VINARHFGEE MKKRQKK...ARHFGEE MKKRQKK...MTIHEEF GHFDGLS...ARHFGEE
CAPSID_3 ELEVSLH...KSPGSGA ELEVSLH...YSAAALS AAAELSA...KSPGSGA MRYGGSVISQ PDKQEVELVD MRYGGSV...QEVELVD MRYGGSV...YSAAALS AAAELSA...QEVELVD

Entries with sequence gaps can be removed setting remove_entries_with_gaps=True:

n_total = len(df_parts)
# Disable warning that entries have been removed and re-instantiate SequenceFeature
aa.options["verbose"] = False
sf = aa.SequenceFeature()
df_parts = sf.get_df_parts(df_seq=df_seq, remove_entries_with_gaps=True)
n_removed = n_total - len(df_parts)
print(f"{n_removed} sequence with gaps were removed")
5 sequence with gaps were removed

SequenceFeature().get_df_parts() works with four different df_seq formats, which we demonstrate using the DOM_GSEC domain level γ-secretase substrates dataset (see [Breimann25a]). Next to the common ‘entry’, ‘sequence’, and ‘label’ columns, this dataset provides columns for the TMD start and stop position (‘tmd_start’, ‘tmd_stop’) and the default sequence parts ‘jmd_n’, ‘tmd’, ‘jmd_c’:

df_seq = aa.load_dataset(name="DOM_GSEC")
aa.display_df(df_seq, n_rows=5)
  entry sequence label tmd_start tmd_stop jmd_n tmd jmd_c
1 P05067 MLPGLALLLLAAWTA...GYENPTYKFFEQMQN 1 701 723 FAEDVGSNKG AIIGLMVGGVVIATVIVITLVML KKKQYTSIHH
2 P14925 MAGRARSGLLLLLLG...EEEYSAPLPKPAPSS 1 868 890 KLSTEPGSGV SVVLITTLLVIPVLVLLAIVMFI RWKKSRAFGD
3 P70180 MRSLLLFTFSACVLL...RELREDSIRSHFSVA 1 477 499 PCKSSGGLEE SAVTGIVVGALLGAGLLMAFYFF RKKYRITIER
4 Q03157 MGPTSPAARGQGRRW...HGYENPTYRFLEERP 1 585 607 APSGTGVSRE ALSGLLIMGAGGGSLIVLSLLLL RKKKPYGTIS
5 Q06481 MAATGTAAAAATGRL...GYENPTYKYLEQMQI 1 694 716 LREDFSLSSS ALIGLLVIAVAIATVIVISLVML RKRQYGTISH
  1. Position-based format:

cols_position_based = ["entry", "sequence", "tmd_start", "tmd_stop"]
df_seq_position_based = df_seq[cols_position_based]
list_parts = ["jmd_n", "tmd", "jmd_c"]
df_parts = sf.get_df_parts(df_seq=df_seq_position_based, list_parts=list_parts)
aa.display_df(df_parts, n_rows=3)
  jmd_n tmd jmd_c
entry      
P05067 FAEDVGSNKG AIIGLMVGGVVIATVIVITLVML KKKQYTSIHH
P14925 KLSTEPGSGV SVVLITTLLVIPVLVLLAIVMFI RWKKSRAFGD
P70180 PCKSSGGLEE SAVTGIVVGALLGAGLLMAFYFF RKKYRITIER
# tmd_start and tmd_stop are 1-based, so that the sequences are extracted as in starndard biological
# annotation formats (e.g., UniProt). This allows direct use of annotated positions without conversion.
df_seq_position_based["tmd_start"] = 1
df_seq_position_based["tmd_stop"] = 5
print("Input sequence dataset with adjusted tmd_start and tmd_stop postions")
aa.display_df(df_seq_position_based, n_rows=3)

print("df_parts with TMD starting at position 1 and ending at position 5")
df_parts = sf.get_df_parts(df_seq=df_seq_position_based, list_parts=["tmd"])
aa.display_df(df_parts, n_rows=3)
Input sequence dataset with adjusted tmd_start and tmd_stop postions
  entry sequence tmd_start tmd_stop
1 P05067 MLPGLALLLLAAWTA...GYENPTYKFFEQMQN 1 5
2 P14925 MAGRARSGLLLLLLG...EEEYSAPLPKPAPSS 1 5
3 P70180 MRSLLLFTFSACVLL...RELREDSIRSHFSVA 1 5
df_parts with TMD starting at position 1 and ending at position 5
  tmd
entry  
P05067 MLPGL
P14925 MAGRA
P70180 MRSLL
  1. Part-based format:

cols_part_based = ["entry", "jmd_n", "tmd", "jmd_c"]
df_parts = sf.get_df_parts(df_seq=df_seq[cols_part_based], list_parts=list_parts)
aa.display_df(df_parts, n_rows=3)
  jmd_n tmd jmd_c
entry      
P05067 FAEDVGSNKG AIIGLMVGGVVIATVIVITLVML KKKQYTSIHH
P14925 KLSTEPGSGV SVVLITTLLVIPVLVLLAIVMFI RWKKSRAFGD
P70180 PCKSSGGLEE SAVTGIVVGALLGAGLLMAFYFF RKKYRITIER
  1. Sequence-TMD-based format:

cols_sequence_tmd_based = ["entry", "sequence", "tmd"]
df_parts = sf.get_df_parts(df_seq=df_seq[cols_sequence_tmd_based], list_parts=list_parts)
aa.display_df(df_parts, n_rows=3)
  jmd_n tmd jmd_c
entry      
P05067 FAEDVGSNKG AIIGLMVGGVVIATVIVITLVML KKKQYTSIHH
P14925 KLSTEPGSGV SVVLITTLLVIPVLVLLAIVMFI RWKKSRAFGD
P70180 PCKSSGGLEE SAVTGIVVGALLGAGLLMAFYFF RKKYRITIER
  1. Sequence-based format:

cols_sequence_based = ["entry", "sequence"]
df_parts = sf.get_df_parts(df_seq=df_seq[cols_sequence_based], list_parts=list_parts)
# Only providing the sequence will create flanking jmd_n and jmd_c regions defined by
# their length and the tmd as the remaining middle part
aa.display_df(df_parts, n_rows=3)
  jmd_n tmd jmd_c
entry      
P05067 MLPGLALLLL AAWTARALEVPTDGN...ERHLSKMQQNGYENP TYKFFEQMQN
P14925 MAGRARSGLL LLLLGLLALQSSCLA...EKDEDDGTESEEEYS APLPKPAPSS
P70180 MRSLLLFTFS ACVLLARVLLAGGAS...QQEESNIGKHRELRE DSIRSHFSVA