aaanalysis.read_fasta

class aaanalysis.read_fasta(file_path, col_id='entry', col_seq='sequence', sep='|', col_db=None, cols_info=None)[source]

Bases:

Read an FASTA file into a DataFrame.

Translation of FASTA file by extracting identifiers and further information from headers as well as subsequent sequences.

Added in version 1.0.0.

Parameters:
  • file_path (str) – Path to the FASTA file.

  • col_id (str, default='entry') – Column name for the sequence identifiers in the resulting DataFrame.

  • col_seq (str, default='sequence') – Column name for the sequences in the resulting DataFrame.

  • sep (str, default='|') – Separator used for splitting identifier and additional information in the FASTA headers.

  • col_db (str, optional) – Column name for databases. First entry of FASTA header if given.

  • cols_info (list of str, optional) – Specifies custom column names for the additional info extracted from headers. If not provided, defaults to ‘info1’, ‘info2’, etc.

Returns:

df_seq – A DataFrame (df_seq) where each row corresponds to a sequence entry from the FASTA file.

Return type:

pandas.DataFrame

Notes

Each FASTA file entry consists of two parts:

  • FASTA header: Starting with ‘>’, the header contains the main id and additional information, all separated by a specified separator.

  • Sequence: Sequence of specific entry, directly following the header

df_seq includes at least these columns:

  • ‘entry’: Protein identifier, either the UniProt accession number or an id based on index.

  • ‘sequence’: Amino acid sequence.

See also

Examples

You can read FASTA files using the read_fasta() function:

import aaanalysis as aa
file_path = "data/example_FASTA.fasta"
df_seq = aa.read_fasta(file_path)
aa.display_df(df_seq, n_rows=4)
  entry sequence
1 SEMA4A,38.4 LAAQQSYWPHFVTVT...IILVASPLRALRARG
2 SEMA4B,47.0 WGADRSYWKEFLVMC...LFLLYRHRNSMKVFL
3 SEMA4C,86.6 EARAPLENLGLVWLA...LLLVLSLRRRLREEL
4 SEMA4D,19.1 TMYLKSSDNRLLMSL...FFYNCYKGYLPRQCL

To adjust the names of the columns for the primary FASTA file information, use the col_id and col_seq parameters:

df_seq = aa.read_fasta(file_path, col_id="ENTRY", col_seq="SEQUENCE")
aa.display_df(df_seq, n_rows=4)
  ENTRY SEQUENCE
1 SEMA4A,38.4 LAAQQSYWPHFVTVT...IILVASPLRALRARG
2 SEMA4B,47.0 WGADRSYWKEFLVMC...LFLLYRHRNSMKVFL
3 SEMA4C,86.6 EARAPLENLGLVWLA...LLLVLSLRRRLREEL
4 SEMA4D,19.1 TMYLKSSDNRLLMSL...FFYNCYKGYLPRQCL

The col_id column should only contain the unique identifier. If the FASTA file comprises additional information, use the sep (default=‘|’) argument to save them in additional columns, named info1 to info(n):

df_seq = aa.read_fasta(file_path, sep=",")
aa.display_df(df_seq, n_rows=4)
  entry sequence info1
1 SEMA4A LAAQQSYWPHFVTVT...IILVASPLRALRARG 38.4
2 SEMA4B WGADRSYWKEFLVMC...LFLLYRHRNSMKVFL 47.0
3 SEMA4C EARAPLENLGLVWLA...LLLVLSLRRRLREEL 86.6
4 SEMA4D TMYLKSSDNRLLMSL...FFYNCYKGYLPRQCL 19.1

To adjust the name of the additional columns, provide a list of column names by cols_info:

df_seq = aa.read_fasta(file_path, sep=",", cols_info=["prediction"])
aa.display_df(df_seq, n_rows=4)
  entry sequence prediction
1 SEMA4A LAAQQSYWPHFVTVT...IILVASPLRALRARG 38.4
2 SEMA4B WGADRSYWKEFLVMC...LFLLYRHRNSMKVFL 47.0
3 SEMA4C EARAPLENLGLVWLA...LLLVLSLRRRLREEL 86.6
4 SEMA4D TMYLKSSDNRLLMSL...FFYNCYKGYLPRQCL 19.1

The headers of FASTA files can start with a database abbreviation (e.g., ‘sp’ for Swiss-Prot). To properly convert these into a database column, provide a name to the col_db parameter:

file_path = "data/example_FASTA_db.fasta"
df_seq = aa.read_fasta(file_path, col_db="database", sep=",")
aa.display_df(df_seq, n_rows=4)
  entry sequence database info1
1 SEMA4A LAAQQSYWPHFVTVT...IILVASPLRALRARG sp 38.4
2 SEMA4B WGADRSYWKEFLVMC...LFLLYRHRNSMKVFL sp 47.0
3 SEMA4C EARAPLENLGLVWLA...LLLVLSLRRRLREEL sp 86.6
4 SEMA4D TMYLKSSDNRLLMSL...FFYNCYKGYLPRQCL sp 19.1