to_fasta

class to_fasta(df_seq, file_path, col_id='entry', col_seq='sequence', sep='|', col_db=None, cols_info=None)[source]

Bases:

Write sequence DataFrame to a FASTA file.

Saving a DataFrame to a FASTA file that includes sequence identifiers, the sequences themselves, and additional selected information.

Added in version 1.0.0.

Parameters:

df_seq (pd.DataFrame) – DataFrame containing an entry column with unique protein identifiers and a sequence column with full protein sequences, plus any additional columns.
file_path (str) – Path where the FASTA file will be saved.
col_id (str, default='entry') – Column name in df for the sequence identifiers.
col_seq (str, default='sequence') – Column name in df for the sequences.
sep (str, default='|') – Separator used to divide different pieces of information in the FASTA header.
col_db (str, optional) – Column name in df for the database source of the sequence.
cols_info (list of str, optional) – List of column names for additional information to include in the FASTA header.

Return type:

None

Notes

The FASTA header for each sequence is composed as follows: >[col_db](optional)|[col_id]|[info1]|…|[infoN] followed by the sequence on the next line.

See also

read_fasta(): the respective FASTA reading function.
Further information and examples on FASTA format in BioPerl documentation.
Use the FASTA format to create a BioPython SeqIO object, which supports various file formats in computational biology.

Examples

You can save a sequence DataFrame (‘df_seq’) as a FASTA file using the to_fasta() function:

import aaanalysis as aa
# Load example dataset
df_seq = aa.load_dataset(name="SEQ_AMYLO", n=10)
aa.display_df(df_seq)

# Save as fasta
file_path = "data/example_FASTA_save.fasta"
aa.to_fasta(df_seq=df_seq, file_path=file_path)

	entry	sequence	label
1	AMYLO_1	AAAQAA	0
2	AMYLO_2	QSSYSS	0
3	AMYLO_3	QSYGQQ	0
4	AMYLO_4	QSYNPP	0
5	AMYLO_5	QSYSGY	0
6	AMYLO_6	QTDARN	0
7	AMYLO_7	QTEEKK	0
8	AMYLO_8	QTFNLF	0
9	AMYLO_9	QTNLYG	0
10	AMYLO_10	QVGFGN	0
11	AMYLO_904	NTVIIE	1
12	AMYLO_905	NQQNQY	1
13	AMYLO_906	NQNNFV	1
14	AMYLO_907	NQIVYK	1
15	AMYLO_908	NINKSN	1
16	AMYLO_909	NQFNLM	1
17	AMYLO_910	NNSGPN	1
18	AMYLO_911	NNQQNY	1
19	AMYLO_912	NKGAII	1
20	AMYLO_913	NNNWSL	1

The names of the entry and sequence columns from the DataFrame should match with the col_id and col_seq parameters:

# Save DataFrame with specified columns
aa.to_fasta(df_seq=df_seq, file_path=file_path, col_seq="sequence", col_id="entry")

Change the seperator of the FASTA headers using sep:

aa.to_fasta(df_seq=df_seq, file_path=file_path, sep=",")

Specify the database column using the col_db parameter:

# Add database (Swiss-Prot) to df_seq
df_seq["database"] = "sp"
# Save FASTA with database information in header
aa.to_fasta(df_seq=df_seq, file_path=file_path, col_db="database")

To save additional information in the header, use the cols_info parameter providing other DataFrame column names:

aa.to_fasta(df_seq=df_seq, file_path=file_path, cols_info=["label"])