aaanalysis.SequenceFeature.get_df_parts
- SequenceFeature.get_df_parts(df_seq=None, list_parts=None, all_parts=False, jmd_n_len=10, jmd_c_len=10, tmd_len=None, remove_entries_with_gaps=False, replace_non_canonical_aa=False)[source]
Create DataFrane with selected sequence parts.
Changed in version 1.1.0: Added the
pos-anchor input mode (tmd_len).- Parameters:
df_seq (pd.DataFrame, shape (n_samples, n_seq_info)) – DataFrame containing an
entrycolumn with unique protein identifiers and sequence information in a distinct format: Position-based, Part-based, Sequence-based, or Sequence-TMD-based.list_parts (list of str, default={
tmd,jmd_n_tmd_n,tmd_c_jmd_c}) – Names of sequence parts that should be obtained for sequences fromdf_seq.jmd_n_len (int, default=10) – Length of JMD-N in number of amino acids. If
None,jmd_nandjmd_cshould be given.jmd_c_len (int, default=10) – Length of JMD-N in number of amino acids. If
None,jmd_nandjmd_cshould be given.tmd_len (int, optional) – TMD length in amino acids for the Anchor-based format only (a
sequence+posdf_seq). Each 1-based anchor inposis placed at the P1 position of a length-tmd_lenTMD (right-heavy for eventmd_len); ignored for the other formats.all_parts (bool, default=False) – Whether to create DataFrame with all possible sequence parts (if
True) or parts given bylist_parts.remove_entries_with_gaps (bool, default=False) – Whether to exclude entries containing missing residues in their sequence parts (if
True), usually resulting from sequences being too short.replace_non_canonical_aa (bool, default=False) – Whether to replace non-canonical amino acids (e.g., ‘X’) by gap (‘-’) symbol.
- Returns:
df_parts – Sequence parts DataFrame.
- Return type:
pd.DataFrame, shape (n_samples, n_parts)
See also
aaanalysis.SequenceFeaturefor definition of parts, and lists of all existing and default parts.
Notes
If
ext_lenin aaanalysis.options is not set to > 0, following parts containing extended tmd are not considered forall_parts=True: [‘tmd_e’, ‘ext_c’, ‘ext_n’, ‘ext_n_tmd_n’, ‘tmd_c_ext_c’].jmd_n_lenandjmd_c_lenmust be both given, except for the part-based format.tmd_startandtmd_stopuse 1-based indexing to follow standard biological annotation conventions (e.g., UniProt), where residue positions start at 1. This allows direct use of annotated positions without conversion.
Formats for
df_seqare differentiated by their respective columns:- Position-based format
‘sequence’: The complete amino acid sequence.
‘tmd_start’: Starting position of the TMD in the sequence (1-based, inclusive).
‘tmd_stop’: Ending position of the TMD in the sequence (1-based, inclusive).
- Part-based format
‘jmd_n’: Amino acid sequence for JMD-N.
‘tmd’: Amino acid sequence for TMD.
‘jmd_c’: Amino acid sequence for JMD-C.
- Sequence-TMD-based format
‘sequence’ and ‘tmd’ columns.
- Sequence-based format
Only the ‘sequence’ column.
- Anchor-based format
‘sequence’ and ‘pos’ columns, together with the
tmd_lenargument.‘pos’: per-row 1-based P1 anchor position(s) — a single
intor alist[int]. Each anchor is exploded into one row whose TMD is centered (right-heavy for eventmd_len) on the anchor; multi-anchor rows yield multiple rows, ided in the index by<entry>_<win_start>-<win_stop>.
Examples
To demonstrate the
SequenceFeature().get_df_parts(), we first obtain an example sequence dataset using theload_dataset()function:import aaanalysis as aa sf = aa.SequenceFeature() df_seq = aa.load_dataset(name="SEQ_CAPSID") aa.display_df(df_seq, n_rows=3, show_shape=True, char_limit=15)
DataFrame shape: (7862, 3)
entry sequence label 1 CAPSID_1 MVTHNVK...KLDEENV 0 2 CAPSID_2 MKKRQKK...ARHFGEE 0 3 CAPSID_3 MRYGGSV...QEVELVD 0 By default, three sequence parts (
tmd,jmd_n_tmd_n,tmd_c_jmd_c) with ajmd_nandjmd_clength of each 10 residues are provided:df_parts = sf.get_df_parts(df_seq=df_seq) aa.display_df(df=df_parts, n_rows=5, show_shape=True, char_limit=15)
DataFrame shape: (7862, 3)
tmd jmd_n_tmd_n tmd_c_jmd_c entry CAPSID_1 HVTRRSY...DDDTPRI MVTHNVK...FTNPTVT ARHGDNN...KLDEENV CAPSID_2 SNFTDTS...RMAMLEA MKKRQKK...MTIHEEF GHFDGLS...ARHFGEE CAPSID_3 ELEVSLH...KSPGSGA MRYGGSV...YSAAALS AAAELSA...QEVELVD CAPSID_4 VGRHRRI...KRRQALE MERGDIP...KAEDVSK YQRIRDE...EMDAGLI CAPSID_5 LILFTQV...QMPSGCV MKRIYLL...EMIKSTP YGNTITL...ITELTHQ Any combination of valid sequence parts can be obtained using the
list_partparameter:df_parts = sf.get_df_parts(df_seq=df_seq, list_parts=['jmd_n', 'tmd', 'jmd_c', 'tmd_jmd']) aa.display_df(df=df_parts, n_rows=3, show_shape=True, char_limit=15)
DataFrame shape: (7862, 4)
jmd_n tmd jmd_c tmd_jmd entry CAPSID_1 MVTHNVKINK HVTRRSY...DDDTPRI PATKLDEENV MVTHNVK...KLDEENV CAPSID_2 MKKRQKKMTL SNFTDTS...RMAMLEA VINARHFGEE MKKRQKK...ARHFGEE CAPSID_3 MRYGGSVISQ ELEVSLH...KSPGSGA PDKQEVELVD MRYGGSV...QEVELVD Set the length of both JMDs by the
jmd_c_lenandjmd_n_lenparameters:df_parts = sf.get_df_parts(df_seq=df_seq, list_parts=['jmd_n', 'tmd', 'jmd_c', 'tmd_jmd'], jmd_c_len=5, jmd_n_len=20) aa.display_df(df=df_parts, n_rows=3, show_shape=True, char_limit=15)
DataFrame shape: (7862, 4)
jmd_n tmd jmd_c tmd_jmd entry CAPSID_1 MVTHNVK...RRSYSSA KEVLEIP...RIPATKL DEENV MVTHNVK...KLDEENV CAPSID_2 MKKRQKK...TDTSFQD FVSAEQV...EAVINAR HFGEE MKKRQKK...ARHFGEE CAPSID_3 MRYGGSV...VSLHMAF VEARSAR...GAPDKQE VELVD MRYGGSV...QEVELVD A JMD length of 0 is indicated by ‘…’:
df_parts = sf.get_df_parts(df_seq=df_seq, list_parts=['jmd_n', 'jmd_n_tmd_n'], jmd_n_len=0) aa.display_df(df=df_parts, n_rows=3, show_shape=True, char_limit=15)
DataFrame shape: (7862, 2)
jmd_n jmd_n_tmd_n entry CAPSID_1 MVTHNVK...DDGSFTN CAPSID_2 MKKRQKK...MMLDLMT CAPSID_3 MRYGGSV...EHHNVRY To select all possible parts, set
all_parts=True:df_parts = sf.get_df_parts(df_seq=df_seq, all_parts=True) aa.display_df(df=df_parts, n_rows=3, show_shape=True, char_limit=15)
DataFrame shape: (7862, 8)
tmd tmd_n tmd_c jmd_n jmd_c tmd_jmd jmd_n_tmd_n tmd_c_jmd_c entry CAPSID_1 HVTRRSY...DDDTPRI HVTRRSY...FTNPTVT ARHGDNN...DDDTPRI MVTHNVKINK PATKLDEENV MVTHNVK...KLDEENV MVTHNVK...FTNPTVT ARHGDNN...KLDEENV CAPSID_2 SNFTDTS...RMAMLEA SNFTDTS...MTIHEEF GHFDGLS...RMAMLEA MKKRQKKMTL VINARHFGEE MKKRQKK...ARHFGEE MKKRQKK...MTIHEEF GHFDGLS...ARHFGEE CAPSID_3 ELEVSLH...KSPGSGA ELEVSLH...YSAAALS AAAELSA...KSPGSGA MRYGGSVISQ PDKQEVELVD MRYGGSV...QEVELVD MRYGGSV...YSAAALS AAAELSA...QEVELVD Entries with sequence gaps can be removed setting
remove_entries_with_gaps=True:n_total = len(df_parts) # Disable warning that entries have been removed and re-instantiate SequenceFeature aa.options["verbose"] = False sf = aa.SequenceFeature() df_parts = sf.get_df_parts(df_seq=df_seq, remove_entries_with_gaps=True) n_removed = n_total - len(df_parts) print(f"{n_removed} sequence with gaps were removed")
5 sequence with gaps were removed
SequenceFeature().get_df_parts()works with four differentdf_seqformats, which we demonstrate using theDOM_GSECdomain level γ-secretase substrates dataset (see [Breimann25a]). Next to the common ‘entry’, ‘sequence’, and ‘label’ columns, this dataset provides columns for the TMD start and stop position (‘tmd_start’, ‘tmd_stop’) and the default sequence parts ‘jmd_n’, ‘tmd’, ‘jmd_c’:df_seq = aa.load_dataset(name="DOM_GSEC") aa.display_df(df_seq, n_rows=5)
entry sequence label tmd_start tmd_stop jmd_n tmd jmd_c 1 P05067 MLPGLALLLLAAWTA...GYENPTYKFFEQMQN 1 701 723 FAEDVGSNKG AIIGLMVGGVVIATVIVITLVML KKKQYTSIHH 2 P14925 MAGRARSGLLLLLLG...EEEYSAPLPKPAPSS 1 868 890 KLSTEPGSGV SVVLITTLLVIPVLVLLAIVMFI RWKKSRAFGD 3 P70180 MRSLLLFTFSACVLL...RELREDSIRSHFSVA 1 477 499 PCKSSGGLEE SAVTGIVVGALLGAGLLMAFYFF RKKYRITIER 4 Q03157 MGPTSPAARGQGRRW...HGYENPTYRFLEERP 1 585 607 APSGTGVSRE ALSGLLIMGAGGGSLIVLSLLLL RKKKPYGTIS 5 Q06481 MAATGTAAAAATGRL...GYENPTYKYLEQMQI 1 694 716 LREDFSLSSS ALIGLLVIAVAIATVIVISLVML RKRQYGTISH Position-based format:
cols_position_based = ["entry", "sequence", "tmd_start", "tmd_stop"] df_seq_position_based = df_seq[cols_position_based] list_parts = ["jmd_n", "tmd", "jmd_c"] df_parts = sf.get_df_parts(df_seq=df_seq_position_based, list_parts=list_parts) aa.display_df(df_parts, n_rows=3)
jmd_n tmd jmd_c entry P05067 FAEDVGSNKG AIIGLMVGGVVIATVIVITLVML KKKQYTSIHH P14925 KLSTEPGSGV SVVLITTLLVIPVLVLLAIVMFI RWKKSRAFGD P70180 PCKSSGGLEE SAVTGIVVGALLGAGLLMAFYFF RKKYRITIER # tmd_start and tmd_stop are 1-based, so that the sequences are extracted as in starndard biological # annotation formats (e.g., UniProt). This allows direct use of annotated positions without conversion. df_seq_position_based["tmd_start"] = 1 df_seq_position_based["tmd_stop"] = 5 print("Input sequence dataset with adjusted tmd_start and tmd_stop postions") aa.display_df(df_seq_position_based, n_rows=3) print("df_parts with TMD starting at position 1 and ending at position 5") df_parts = sf.get_df_parts(df_seq=df_seq_position_based, list_parts=["tmd"]) aa.display_df(df_parts, n_rows=3)
Input sequence dataset with adjusted tmd_start and tmd_stop postions
entry sequence tmd_start tmd_stop 1 P05067 MLPGLALLLLAAWTA...GYENPTYKFFEQMQN 1 5 2 P14925 MAGRARSGLLLLLLG...EEEYSAPLPKPAPSS 1 5 3 P70180 MRSLLLFTFSACVLL...RELREDSIRSHFSVA 1 5 df_parts with TMD starting at position 1 and ending at position 5
tmd entry P05067 MLPGL P14925 MAGRA P70180 MRSLL Part-based format:
cols_part_based = ["entry", "jmd_n", "tmd", "jmd_c"] df_parts = sf.get_df_parts(df_seq=df_seq[cols_part_based], list_parts=list_parts) aa.display_df(df_parts, n_rows=3)
jmd_n tmd jmd_c entry P05067 FAEDVGSNKG AIIGLMVGGVVIATVIVITLVML KKKQYTSIHH P14925 KLSTEPGSGV SVVLITTLLVIPVLVLLAIVMFI RWKKSRAFGD P70180 PCKSSGGLEE SAVTGIVVGALLGAGLLMAFYFF RKKYRITIER Sequence-TMD-based format:
cols_sequence_tmd_based = ["entry", "sequence", "tmd"] df_parts = sf.get_df_parts(df_seq=df_seq[cols_sequence_tmd_based], list_parts=list_parts) aa.display_df(df_parts, n_rows=3)
jmd_n tmd jmd_c entry P05067 FAEDVGSNKG AIIGLMVGGVVIATVIVITLVML KKKQYTSIHH P14925 KLSTEPGSGV SVVLITTLLVIPVLVLLAIVMFI RWKKSRAFGD P70180 PCKSSGGLEE SAVTGIVVGALLGAGLLMAFYFF RKKYRITIER Sequence-based format:
cols_sequence_based = ["entry", "sequence"] df_parts = sf.get_df_parts(df_seq=df_seq[cols_sequence_based], list_parts=list_parts) # Only providing the sequence will create flanking jmd_n and jmd_c regions defined by # their length and the tmd as the remaining middle part aa.display_df(df_parts, n_rows=3)
jmd_n tmd jmd_c entry P05067 MLPGLALLLL AAWTARALEVPTDGN...ERHLSKMQQNGYENP TYKFFEQMQN P14925 MAGRARSGLL LLLLGLLALQSSCLA...EKDEDDGTESEEEYS APLPKPAPSS P70180 MRSLLLFTFS ACVLLARVLLAGGAS...QQEESNIGKHRELRE DSIRSHFSVA