SequencePreprocessor.encode_one_hot

static SequencePreprocessor.encode_one_hot(list_seq, alphabet='ACDEFGHIKLMNPQRSTVWY', gap='-', pad_at='C')[source]

One-hot-encode a list of protein sequences into a feature matrix.

Each residue is represented by a binary vector of length equal to the alphabet size. For each sequence position, the amino acid is set to 1 in its corresponding position in the vector, while all other positions are set to 0. Gaps are represented by zero vectors. Shorter sequences are padded with gaps either N- or C-terminally.

Added in version 0.1.0.

Parameters:

list_seq (list of str or str) – List of protein sequences to encode. All characters in each sequence must part of the alphabet or be represented by the gap.
alphabet (str, default='ACDEFGHIKLMNPQRSTVWY') – The alphabet of amino acids used for encoding.
gap (str, default='-') – The character used to represent gaps within sequences. It should not be included in the alphabet.
pad_at (str, default='C') –
Specifies where to add the padding:
- ’N’ for N-terminus (beginning of the sequence),
- ’C’ for C-terminus (end of the sequence).

Returns:

X (array-like, shape (n_samples, n_residues*n_characters)) – Feature matrix containing one-hot encoded position-wise representation of residues.
features (list of str) – List of feature names corresponding to each position and amino acid in the encoded matrix.

Examples

To demonstrate one-hot encoding of a protein sequences using the SequencePreprocessor().encode_one_hot() method, we first create an example sequence:

import aaanalysis as aa
import pandas as pd

list_seq = ["AACDEFGHIY", "IIHGFECDAY"]
seqp = aa.SequencePreprocessor()

Provide the sequence as seq parameter to obtain a feature matrix (X) and the respective features, which are binary representation of each amino acid at given residue positions:

X, features = seqp.encode_one_hot(list_seq=list_seq)

# Convert to DataFrame for visualization
df_one_hot = pd.DataFrame(X, columns=features, index=list_seq)
aa.display_df(df=df_one_hot, show_shape=True)

DataFrame shape: (2, 200)

	1A	1C	1D	1E	1F	1G	1H	1I	1K	1L	1M	1N	1P	1Q	1R	1S	1T	1V	1W	1Y	2A	2C	2D	2E	2F	2G	2H	2I	2K	2L	2M	2N	2P	2Q	2R	2S	2T	2V	2W	2Y	3A	3C	3D	3E	3F	3G	3H	3I	3K	3L	3M	3N	3P	3Q	3R	3S	3T	3V	3W	3Y	4A	4C	4D	4E	4F	4G	4H	4I	4K	4L	4M	4N	4P	4Q	4R	4S	4T	4V	4W	4Y	5A	5C	5D	5E	5F	5G	5H	5I	5K	5L	5M	5N	5P	5Q	5R	5S	5T	5V	5W	5Y	6A	6C	6D	6E	6F	6G	6H	6I	6K	6L	6M	6N	6P	6Q	6R	6S	6T	6V	6W	6Y	7A	7C	7D	7E	7F	7G	7H	7I	7K	7L	7M	7N	7P	7Q	7R	7S	7T	7V	7W	7Y	8A	8C	8D	8E	8F	8G	8H	8I	8K	8L	8M	8N	8P	8Q	8R	8S	8T	8V	8W	8Y	9A	9C	9D	9E	9F	9G	9H	9I	9K	9L	9M	9N	9P	9Q	9R	9S	9T	9V	9W	9Y	10A	10C	10D	10E	10F	10G	10H	10I	10K	10L	10M	10N	10P	10Q	10R	10S	10T	10V	10W	10Y
AACDEFGHIY	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1
IIHGFECDAY	0	0	0	0	0	0	0	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1

You can adjust the used alphabet to change the considered characters:

# Show one-hot encoding with smaller alphabet
list_seq = ["ABC", "CBA"]
ALPHABET = "ABC"
X, features = seqp.encode_one_hot(list_seq=list_seq, alphabet=ALPHABET)

# Convert to DataFrame for visualization
df_one_hot = pd.DataFrame(X, columns=features, index=list_seq)
aa.display_df(df=df_one_hot, show_shape=True)

DataFrame shape: (2, 9)

	1A	1B	1C	2A	2B	2C	3A	3B	3C
ABC	1	0	0	0	1	0	0	0	1
CBA	0	0	1	0	1	0	1	0	0

Change the gap symbol (default=-) as follows:

# Show one-hot encoding with other gap ('*')
list_seq = ["ABC", "CB*"]
ALPHABET = "ABC"
X, features = seqp.encode_one_hot(list_seq=list_seq, alphabet=ALPHABET, gap="*")

# Convert to DataFrame for visualization
df_one_hot = pd.DataFrame(X, columns=features, index=list_seq)
aa.display_df(df=df_one_hot, show_shape=True)

DataFrame shape: (2, 9)

	1A	1B	1C	2A	2B	2C	3A	3B	3C
ABC	1	0	0	0	1	0	0	0	1
CB*	0	0	1	0	1	0	0	0	0

If one sequence is smaller than the other, gaps will be included either at the N-terminus or C-terminus (default), which is called padding. Adjust the padding using the pad_at (N or C) parameter:

# Show default padding (at C-Termius)
list_seq = ["ABC", "B"]
ALPHABET = "ABC"
X, features = seqp.encode_one_hot(list_seq=list_seq, alphabet=ALPHABET)

# Convert to DataFrame for visualization
df_one_hot = pd.DataFrame(X, columns=features, index=list_seq)
aa.display_df(df=df_one_hot, show_shape=True)

DataFrame shape: (2, 9)

	1A	1B	1C	2A	2B	2C	3A	3B	3C
ABC	1	0	0	0	1	0	0	0	1
B	0	1	0	0	0	0	0	0	0

# Show N-terminal padding
list_seq = ["ABC", "B"]
ALPHABET = "ABC"
X, features = seqp.encode_one_hot(list_seq=list_seq, alphabet=ALPHABET, pad_at="N")

# Convert to DataFrame for visualization
df_one_hot = pd.DataFrame(X, columns=features, index=list_seq)
aa.display_df(df=df_one_hot, show_shape=True)

DataFrame shape: (2, 9)

	1A	1B	1C	2A	2B	2C	3A	3B	3C
ABC	1	0	0	0	1	0	0	0	1
B	0	0	0	0	0	0	0	1	0