aaanalysis.SequencePreprocessor.encode_one_hot
- static SequencePreprocessor.encode_one_hot(list_seq=None, alphabet='ACDEFGHIKLMNPQRSTVWY', gap='-', pad_at='C')[source]
One-hot-encode a list of protein sequences into a feature matrix.
Each residue is represented by a binary vector of length equal to the alphabet size. For each sequence position, the amino acid is set to 1 in its corresponding position in the vector, while all other positions are set to 0. Gaps are represented by zero vectors. Shorter sequences are padded with gaps either N- or C-terminally.
- Parameters:
list_seq (list of str or str) – List of protein sequences to encode. All characters in each sequence must part of the
alphabetor be represented by thegap.alphabet (str, default='ACDEFGHIKLMNPQRSTVWY') – The alphabet of amino acids used for encoding.
gap (str, default='-') – The character used to represent gaps within sequences. It should not be included in the
alphabet.pad_at (str, default='C') –
Specifies where to add the padding:
’N’ for N-terminus (beginning of the sequence),
’C’ for C-terminus (end of the sequence).
- Returns:
X (array-like, shape (n_samples, n_residues*n_characters)) – Feature matrix containing one-hot encoded position-wise representation of residues.
features (list of str) – List of feature names corresponding to each position and amino acid in the encoded matrix.
Examples
To demonstrate one-hot encoding of a protein sequences using the
SequencePreprocessor().encode_one_hot()method, we first create an example sequence:import aaanalysis as aa import pandas as pd list_seq = ["AACDEFGHIY", "IIHGFECDAY"] sp = aa.SequencePreprocessor()
Provide the sequence as
seqparameter to obtain a feature matrix (X) and the respectivefeatures, which are binary representation of each amino acid at given residue positions:X, features = sp.encode_one_hot(list_seq=list_seq) # Convert to DataFrame for visualization df_one_hot = pd.DataFrame(X, columns=features, index=list_seq) aa.display_df(df=df_one_hot, show_shape=True)
DataFrame shape: (2, 200)
1A 1C 1D 1E 1F 1G 1H 1I 1K 1L 1M 1N 1P 1Q 1R 1S 1T 1V 1W 1Y 2A 2C 2D 2E 2F 2G 2H 2I 2K 2L 2M 2N 2P 2Q 2R 2S 2T 2V 2W 2Y 3A 3C 3D 3E 3F 3G 3H 3I 3K 3L 3M 3N 3P 3Q 3R 3S 3T 3V 3W 3Y 4A 4C 4D 4E 4F 4G 4H 4I 4K 4L 4M 4N 4P 4Q 4R 4S 4T 4V 4W 4Y 5A 5C 5D 5E 5F 5G 5H 5I 5K 5L 5M 5N 5P 5Q 5R 5S 5T 5V 5W 5Y 6A 6C 6D 6E 6F 6G 6H 6I 6K 6L 6M 6N 6P 6Q 6R 6S 6T 6V 6W 6Y 7A 7C 7D 7E 7F 7G 7H 7I 7K 7L 7M 7N 7P 7Q 7R 7S 7T 7V 7W 7Y 8A 8C 8D 8E 8F 8G 8H 8I 8K 8L 8M 8N 8P 8Q 8R 8S 8T 8V 8W 8Y 9A 9C 9D 9E 9F 9G 9H 9I 9K 9L 9M 9N 9P 9Q 9R 9S 9T 9V 9W 9Y 10A 10C 10D 10E 10F 10G 10H 10I 10K 10L 10M 10N 10P 10Q 10R 10S 10T 10V 10W 10Y AACDEFGHIY 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 IIHGFECDAY 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 You can adjust the used
alphabetto change the considered characters:# Show one-hot encoding with smaller alphabet list_seq = ["ABC", "CBA"] ALPHABET = "ABC" X, features = sp.encode_one_hot(list_seq=list_seq, alphabet=ALPHABET) # Convert to DataFrame for visualization df_one_hot = pd.DataFrame(X, columns=features, index=list_seq) aa.display_df(df=df_one_hot, show_shape=True)
DataFrame shape: (2, 9)
1A 1B 1C 2A 2B 2C 3A 3B 3C ABC 1 0 0 0 1 0 0 0 1 CBA 0 0 1 0 1 0 1 0 0 Change the
gapsymbol (default=-) as follows:# Show one-hot encoding with other gap ('*') list_seq = ["ABC", "CB*"] ALPHABET = "ABC" X, features = sp.encode_one_hot(list_seq=list_seq, alphabet=ALPHABET, gap="*") # Convert to DataFrame for visualization df_one_hot = pd.DataFrame(X, columns=features, index=list_seq) aa.display_df(df=df_one_hot, show_shape=True)DataFrame shape: (2, 9)
1A 1B 1C 2A 2B 2C 3A 3B 3C ABC 1 0 0 0 1 0 0 0 1 CB* 0 0 1 0 1 0 0 0 0 If one sequence is smaller than the other, gaps will be included either at the N-terminus or C-terminus (default), which is called padding. Adjust the padding using the
pad_at(NorC) parameter:# Show default padding (at C-Termius) list_seq = ["ABC", "B"] ALPHABET = "ABC" X, features = sp.encode_one_hot(list_seq=list_seq, alphabet=ALPHABET) # Convert to DataFrame for visualization df_one_hot = pd.DataFrame(X, columns=features, index=list_seq) aa.display_df(df=df_one_hot, show_shape=True)
DataFrame shape: (2, 9)
1A 1B 1C 2A 2B 2C 3A 3B 3C ABC 1 0 0 0 1 0 0 0 1 B 0 1 0 0 0 0 0 0 0 # Show N-terminal padding list_seq = ["ABC", "B"] ALPHABET = "ABC" X, features = sp.encode_one_hot(list_seq=list_seq, alphabet=ALPHABET, pad_at="N") # Convert to DataFrame for visualization df_one_hot = pd.DataFrame(X, columns=features, index=list_seq) aa.display_df(df=df_one_hot, show_shape=True)
DataFrame shape: (2, 9)
1A 1B 1C 2A 2B 2C 3A 3B 3C ABC 1 0 0 0 1 0 0 0 1 B 0 0 0 0 0 0 0 1 0