aaanalysis.to_fasta
- class aaanalysis.to_fasta(df_seq=None, file_path=None, col_id='entry', col_seq='sequence', sep='|', col_db=None, cols_info=None)[source]
Bases:
Write sequence DataFrame to a FASTA file.
Saving a DataFrame to a FASTA file that includes sequence identifiers, the sequences themselves, and additional selected information.
Added in version 1.0.0.
- Parameters:
df_seq (pd.DataFrame) – DataFrame containing an
entrycolumn with unique protein identifiers and asequencecolumn with full protein sequences, plus any additional columns.file_path (str) – Path where the FASTA file will be saved.
col_id (str, default='entry') – Column name in df for the sequence identifiers.
col_seq (str, default='sequence') – Column name in df for the sequences.
sep (str, default='|') – Separator used to divide different pieces of information in the FASTA header.
col_db (str, optional) – Column name in df for the database source of the sequence.
cols_info (list of str, optional) – List of column names for additional information to include in the FASTA header.
Notes
The FASTA header for each sequence is composed as follows: >[col_db](optional)|[col_id]|[info1]|…|[infoN] followed by the sequence on the next line.
See also
read_fasta(): the respective FASTA reading function.Further information and examples on FASTA format in BioPerl documentation.
Use the FASTA format to create a BioPython SeqIO object, which supports various file formats in computational biology.
Examples
You can save a sequence DataFrame (‘df_seq’) as a FASTA file using the
to_fasta()function:import aaanalysis as aa # Load example dataset df_seq = aa.load_dataset(name="SEQ_AMYLO", n=10) aa.display_df(df_seq) # Save as fasta file_path = "data/example_FASTA_save.fasta" aa.to_fasta(df_seq=df_seq, file_path=file_path)
entry sequence label 1 AMYLO_1 AAAQAA 0 2 AMYLO_2 QSSYSS 0 3 AMYLO_3 QSYGQQ 0 4 AMYLO_4 QSYNPP 0 5 AMYLO_5 QSYSGY 0 6 AMYLO_6 QTDARN 0 7 AMYLO_7 QTEEKK 0 8 AMYLO_8 QTFNLF 0 9 AMYLO_9 QTNLYG 0 10 AMYLO_10 QVGFGN 0 11 AMYLO_904 NTVIIE 1 12 AMYLO_905 NQQNQY 1 13 AMYLO_906 NQNNFV 1 14 AMYLO_907 NQIVYK 1 15 AMYLO_908 NINKSN 1 16 AMYLO_909 NQFNLM 1 17 AMYLO_910 NNSGPN 1 18 AMYLO_911 NNQQNY 1 19 AMYLO_912 NKGAII 1 20 AMYLO_913 NNNWSL 1 The names of the entry and sequence columns from the DataFrame should match with the
col_idandcol_seqparameters:# Save DataFrame with specified columns aa.to_fasta(df_seq=df_seq, file_path=file_path, col_seq="sequence", col_id="entry")
Change the seperator of the FASTA headers using
sep:aa.to_fasta(df_seq=df_seq, file_path=file_path, sep=",")
Specify the database column using the
col_dbparameter:# Add database (Swiss-Prot) to df_seq df_seq["database"] = "sp" # Save FASTA with database information in header aa.to_fasta(df_seq=df_seq, file_path=file_path, col_db="database")
To save additional information in the header, use the
cols_infoparameter providing other DataFrame column names:aa.to_fasta(df_seq=df_seq, file_path=file_path, cols_info=["label"])