aaanalysis.read_fasta
- class aaanalysis.read_fasta(file_path, col_id='entry', col_seq='sequence', sep='|', col_db=None, cols_info=None)[source]
Bases:
Read an FASTA file into a DataFrame.
Translation of FASTA file by extracting identifiers and further information from headers as well as subsequent sequences.
Added in version 1.0.0.
- Parameters:
file_path (str) – Path to the FASTA file.
col_id (str, default='entry') – Column name for the sequence identifiers in the resulting DataFrame.
col_seq (str, default='sequence') – Column name for the sequences in the resulting DataFrame.
sep (str, default='|') – Separator used for splitting identifier and additional information in the FASTA headers.
col_db (str, optional) – Column name for databases. First entry of FASTA header if given.
cols_info (list of str, optional) – Specifies custom column names for the additional info extracted from headers. If not provided, defaults to ‘info1’, ‘info2’, etc.
- Returns:
df_seq – A DataFrame (
df_seq) where each row corresponds to a sequence entry from the FASTA file.- Return type:
Notes
Each
FASTAfile entry consists of two parts:FASTA header: Starting with ‘>’, the header contains the main id and additional information, all separated by a specified separator.
Sequence: Sequence of specific entry, directly following the header
df_seqincludes at least these columns:‘entry’: Protein identifier, either the UniProt accession number or an id based on index.
‘sequence’: Amino acid sequence.
See also
to_fasta(): the respective FASTA saving function.Further information and examples on FASTA format in BioPerl documentation.
Use the FASTA format to create a BioPython SeqIO object, which supports various file formats in computational biology.
Examples
You can read FASTA files using the
read_fasta()function:import aaanalysis as aa file_path = "data/example_FASTA.fasta" df_seq = aa.read_fasta(file_path) aa.display_df(df_seq, n_rows=4)
entry sequence 1 SEMA4A,38.4 LAAQQSYWPHFVTVT...IILVASPLRALRARG 2 SEMA4B,47.0 WGADRSYWKEFLVMC...LFLLYRHRNSMKVFL 3 SEMA4C,86.6 EARAPLENLGLVWLA...LLLVLSLRRRLREEL 4 SEMA4D,19.1 TMYLKSSDNRLLMSL...FFYNCYKGYLPRQCL To adjust the names of the columns for the primary FASTA file information, use the
col_idandcol_seqparameters:df_seq = aa.read_fasta(file_path, col_id="ENTRY", col_seq="SEQUENCE") aa.display_df(df_seq, n_rows=4)
ENTRY SEQUENCE 1 SEMA4A,38.4 LAAQQSYWPHFVTVT...IILVASPLRALRARG 2 SEMA4B,47.0 WGADRSYWKEFLVMC...LFLLYRHRNSMKVFL 3 SEMA4C,86.6 EARAPLENLGLVWLA...LLLVLSLRRRLREEL 4 SEMA4D,19.1 TMYLKSSDNRLLMSL...FFYNCYKGYLPRQCL The
col_idcolumn should only contain the unique identifier. If the FASTA file comprises additional information, use thesep(default=‘|’) argument to save them in additional columns, namedinfo1toinfo(n):df_seq = aa.read_fasta(file_path, sep=",") aa.display_df(df_seq, n_rows=4)
entry sequence info1 1 SEMA4A LAAQQSYWPHFVTVT...IILVASPLRALRARG 38.4 2 SEMA4B WGADRSYWKEFLVMC...LFLLYRHRNSMKVFL 47.0 3 SEMA4C EARAPLENLGLVWLA...LLLVLSLRRRLREEL 86.6 4 SEMA4D TMYLKSSDNRLLMSL...FFYNCYKGYLPRQCL 19.1 To adjust the name of the additional columns, provide a list of column names by
cols_info:df_seq = aa.read_fasta(file_path, sep=",", cols_info=["prediction"]) aa.display_df(df_seq, n_rows=4)
entry sequence prediction 1 SEMA4A LAAQQSYWPHFVTVT...IILVASPLRALRARG 38.4 2 SEMA4B WGADRSYWKEFLVMC...LFLLYRHRNSMKVFL 47.0 3 SEMA4C EARAPLENLGLVWLA...LLLVLSLRRRLREEL 86.6 4 SEMA4D TMYLKSSDNRLLMSL...FFYNCYKGYLPRQCL 19.1 The headers of FASTA files can start with a database abbreviation (e.g., ‘sp’ for Swiss-Prot). To properly convert these into a database column, provide a name to the
col_dbparameter:file_path = "data/example_FASTA_db.fasta" df_seq = aa.read_fasta(file_path, col_db="database", sep=",") aa.display_df(df_seq, n_rows=4)
entry sequence database info1 1 SEMA4A LAAQQSYWPHFVTVT...IILVASPLRALRARG sp 38.4 2 SEMA4B WGADRSYWKEFLVMC...LFLLYRHRNSMKVFL sp 47.0 3 SEMA4C EARAPLENLGLVWLA...LLLVLSLRRRLREEL sp 86.6 4 SEMA4D TMYLKSSDNRLLMSL...FFYNCYKGYLPRQCL sp 19.1