aaanalysis.to_fasta

class aaanalysis.to_fasta(df_seq=None, file_path=None, col_id='entry', col_seq='sequence', sep='|', col_db=None, cols_info=None)[source]

Bases:

Write sequence DataFrame to a FASTA file.

Saving a DataFrame to a FASTA file that includes sequence identifiers, the sequences themselves, and additional selected information.

Added in version 1.0.0.

Parameters:
  • df_seq (pd.DataFrame) – DataFrame containing an entry column with unique protein identifiers and a sequence column with full protein sequences, plus any additional columns.

  • file_path (str) – Path where the FASTA file will be saved.

  • col_id (str, default='entry') – Column name in df for the sequence identifiers.

  • col_seq (str, default='sequence') – Column name in df for the sequences.

  • sep (str, default='|') – Separator used to divide different pieces of information in the FASTA header.

  • col_db (str, optional) – Column name in df for the database source of the sequence.

  • cols_info (list of str, optional) – List of column names for additional information to include in the FASTA header.

Notes

The FASTA header for each sequence is composed as follows: >[col_db](optional)|[col_id]|[info1]|…|[infoN] followed by the sequence on the next line.

See also

Examples

You can save a sequence DataFrame (‘df_seq’) as a FASTA file using the to_fasta() function:

import aaanalysis as aa
# Load example dataset
df_seq = aa.load_dataset(name="SEQ_AMYLO", n=10)
aa.display_df(df_seq)

# Save as fasta
file_path = "data/example_FASTA_save.fasta"
aa.to_fasta(df_seq=df_seq, file_path=file_path)
  entry sequence label
1 AMYLO_1 AAAQAA 0
2 AMYLO_2 QSSYSS 0
3 AMYLO_3 QSYGQQ 0
4 AMYLO_4 QSYNPP 0
5 AMYLO_5 QSYSGY 0
6 AMYLO_6 QTDARN 0
7 AMYLO_7 QTEEKK 0
8 AMYLO_8 QTFNLF 0
9 AMYLO_9 QTNLYG 0
10 AMYLO_10 QVGFGN 0
11 AMYLO_904 NTVIIE 1
12 AMYLO_905 NQQNQY 1
13 AMYLO_906 NQNNFV 1
14 AMYLO_907 NQIVYK 1
15 AMYLO_908 NINKSN 1
16 AMYLO_909 NQFNLM 1
17 AMYLO_910 NNSGPN 1
18 AMYLO_911 NNQQNY 1
19 AMYLO_912 NKGAII 1
20 AMYLO_913 NNNWSL 1

The names of the entry and sequence columns from the DataFrame should match with the col_id and col_seq parameters:

# Save DataFrame with specified columns
aa.to_fasta(df_seq=df_seq, file_path=file_path, col_seq="sequence", col_id="entry")

Change the seperator of the FASTA headers using sep:

aa.to_fasta(df_seq=df_seq, file_path=file_path, sep=",")

Specify the database column using the col_db parameter:

# Add database (Swiss-Prot) to df_seq
df_seq["database"] = "sp"
# Save FASTA with database information in header
aa.to_fasta(df_seq=df_seq, file_path=file_path, col_db="database")

To save additional information in the header, use the cols_info parameter providing other DataFrame column names:

aa.to_fasta(df_seq=df_seq, file_path=file_path, cols_info=["label"])