read_fasta

class read_fasta(file_path, col_id='entry', col_seq='sequence', sep='|', col_db=None, cols_info=None)[source]

Bases:

Read a FASTA file into a DataFrame.

Translation of FASTA file by extracting identifiers and further information from headers as well as subsequent sequences.

Added in version 1.0.0.

Parameters:

file_path (str) – Path to the FASTA file.
col_id (str, default='entry') – Column name for the sequence identifiers in the resulting DataFrame.
col_seq (str, default='sequence') – Column name for the sequences in the resulting DataFrame.
sep (str, default='|') – Separator used for splitting identifier and additional information in the FASTA headers.
col_db (str, optional) – Column name for databases. First entry of FASTA header if given.
cols_info (list of str, optional) – Specifies custom column names for the additional info extracted from headers. If not provided, defaults to ‘info1’, ‘info2’, etc.

Returns:

df_seq – A DataFrame (df_seq) where each row corresponds to a sequence entry from the FASTA file.

Return type:

pandas.DataFrame

Notes

Each FASTA file entry consists of two parts:

FASTA header: Starting with ‘>’, the header contains the main id and additional information, all separated by a specified separator.
Sequence: Sequence of specific entry, directly following the header

df_seq includes at least these columns:

‘entry’: Protein identifier, either the UniProt accession number or an id based on index.
‘sequence’: Amino acid sequence.

See also

to_fasta(): the respective FASTA saving function.
Further information and examples on FASTA format in BioPerl documentation.
Use the FASTA format to create a BioPython SeqIO object, which supports various file formats in computational biology.

Examples

You can read FASTA files using the read_fasta() function:

import aaanalysis as aa
file_path = "data/example_FASTA.fasta"
df_seq = aa.read_fasta(file_path)
aa.display_df(df_seq, n_rows=4)

	entry	sequence
1	SEMA4A,38.4	LAAQQSYWPHFVTVT...IILVASPLRALRARG
2	SEMA4B,47.0	WGADRSYWKEFLVMC...LFLLYRHRNSMKVFL
3	SEMA4C,86.6	EARAPLENLGLVWLA...LLLVLSLRRRLREEL
4	SEMA4D,19.1	TMYLKSSDNRLLMSL...FFYNCYKGYLPRQCL

To adjust the names of the columns for the primary FASTA file information, use the col_id and col_seq parameters:

df_seq = aa.read_fasta(file_path, col_id="ENTRY", col_seq="SEQUENCE")
aa.display_df(df_seq, n_rows=4)

	ENTRY	SEQUENCE
1	SEMA4A,38.4	LAAQQSYWPHFVTVT...IILVASPLRALRARG
2	SEMA4B,47.0	WGADRSYWKEFLVMC...LFLLYRHRNSMKVFL
3	SEMA4C,86.6	EARAPLENLGLVWLA...LLLVLSLRRRLREEL
4	SEMA4D,19.1	TMYLKSSDNRLLMSL...FFYNCYKGYLPRQCL

The col_id column should only contain the unique identifier. If the FASTA file comprises additional information, use the sep (default=‘|’) argument to save them in additional columns, named info1 to info(n):

df_seq = aa.read_fasta(file_path, sep=",")
aa.display_df(df_seq, n_rows=4)

	entry	sequence	info1
1	SEMA4A	LAAQQSYWPHFVTVT...IILVASPLRALRARG	38.4
2	SEMA4B	WGADRSYWKEFLVMC...LFLLYRHRNSMKVFL	47.0
3	SEMA4C	EARAPLENLGLVWLA...LLLVLSLRRRLREEL	86.6
4	SEMA4D	TMYLKSSDNRLLMSL...FFYNCYKGYLPRQCL	19.1

To adjust the name of the additional columns, provide a list of column names by cols_info:

df_seq = aa.read_fasta(file_path, sep=",", cols_info=["prediction"])
aa.display_df(df_seq, n_rows=4)

	entry	sequence	prediction
1	SEMA4A	LAAQQSYWPHFVTVT...IILVASPLRALRARG	38.4
2	SEMA4B	WGADRSYWKEFLVMC...LFLLYRHRNSMKVFL	47.0
3	SEMA4C	EARAPLENLGLVWLA...LLLVLSLRRRLREEL	86.6
4	SEMA4D	TMYLKSSDNRLLMSL...FFYNCYKGYLPRQCL	19.1

The headers of FASTA files can start with a database abbreviation (e.g., ‘sp’ for Swiss-Prot). To properly convert these into a database column, provide a name to the col_db parameter:

file_path = "data/example_FASTA_db.fasta"
df_seq = aa.read_fasta(file_path, col_db="database", sep=",")
aa.display_df(df_seq, n_rows=4)

	entry	sequence	database	info1
1	SEMA4A	LAAQQSYWPHFVTVT...IILVASPLRALRARG	sp	38.4
2	SEMA4B	WGADRSYWKEFLVMC...LFLLYRHRNSMKVFL	sp	47.0
3	SEMA4C	EARAPLENLGLVWLA...LLLVLSLRRRLREEL	sp	86.6
4	SEMA4D	TMYLKSSDNRLLMSL...FFYNCYKGYLPRQCL	sp	19.1