SequencePreprocessor.encode_integer

static SequencePreprocessor.encode_integer(list_seq, alphabet='ACDEFGHIKLMNPQRSTVWY', gap='-', pad_at='C')[source]

Integer-encode a list of protein sequences into a feature matrix.

Each amino acid is represented by an integer between 1 and n, where n is the number of characters. Gaps are represented by 0. Shorter sequences are padded with gaps either N- or C-terminally.

Added in version 0.1.0.

Parameters:

list_seq (list of str or str) – List of protein sequences to encode. All characters in each sequence must part of the alphabet or be represented by the gap.
alphabet (str, default='ACDEFGHIKLMNPQRSTVWY') – The alphabet of amino acids used for encoding.
gap (str, default='-') – The character used to represent gaps within sequences. It should not be included in the alphabet.
pad_at (str, default='C') –
Specifies where to add the padding:
- ’N’ for N-terminus (beginning of the sequence),
- ’C’ for C-terminus (end of the sequence).

Returns:

X (array-like, shape (n_samples, n_residues)) – Feature matrix containing integer encoded position-wise representation of residues.
features (list of str) – List of feature names corresponding to each position in the encoded matrix.

Examples

To demonstrate integer encoding of protein sequences using the SequencePreprocessor().encode_integer() method, we first create an example sequence:

import aaanalysis as aa
import pandas as pd

list_seq = ["AACDEFGHIY", "IIHGFECDAY"]
seqp = aa.SequencePreprocessor()

Provide the sequence as seq parameter to obtain a feature matrix (X) and the respective features, which are integer amino acid representation at given residue positions:

X, features = seqp.encode_integer(list_seq=list_seq)

# Convert to DataFrame for visualization
df_encode = pd.DataFrame(X, columns=features, index=list_seq)
aa.display_df(df=df_encode, show_shape=True)

DataFrame shape: (2, 10)

	P1	P2	P3	P4	P5	P6	P7	P8	P9	P10
AACDEFGHIY	1	1	2	3	4	5	6	7	8	20
IIHGFECDAY	8	8	7	6	5	4	2	3	1	20

You can adjust the used alphabet to change the considered characters:

# Show integer encoding with smaller alphabet
list_seq = ["ABC", "CBA"]
ALPHABET = "ABC"
X, features = seqp.encode_integer(list_seq=list_seq, alphabet=ALPHABET)

# Convert to DataFrame for visualization
df_encode = pd.DataFrame(X, columns=features, index=list_seq)
aa.display_df(df=df_encode, show_shape=True)

DataFrame shape: (2, 3)

	P1	P2	P3
ABC	1	2	3
CBA	3	2	1

Change the gap symbol (default=-) as follows:

# Show integer encoding with other gap ('*')
list_seq = ["ABC", "CB*"]
ALPHABET = "ABC"
X, features = seqp.encode_integer(list_seq=list_seq, alphabet=ALPHABET, gap="*")

# Convert to DataFrame for visualization
df_encode = pd.DataFrame(X, columns=features, index=list_seq)
aa.display_df(df=df_encode, show_shape=True)

DataFrame shape: (2, 3)

	P1	P2	P3
ABC	1	2	3
CB*	3	2	0

If one sequence is smaller than the other, gaps will be included either at the N-terminus or C-terminus (default), which is called padding. Adjust the padding using the pad_at (N or C) parameter:

# Show default padding (at C-Termius)
list_seq = ["ABC", "B"]
ALPHABET = "ABC"
X, features = seqp.encode_integer(list_seq=list_seq, alphabet=ALPHABET)

# Convert to DataFrame for visualization
df_encode = pd.DataFrame(X, columns=features, index=list_seq)
aa.display_df(df=df_encode, show_shape=True)

DataFrame shape: (2, 3)

	P1	P2	P3
ABC	1	2	3
B	2	0	0

# Show N-terminal padding
list_seq = ["ABC", "B"]
ALPHABET = "ABC"
X, features = seqp.encode_integer(list_seq=list_seq, alphabet=ALPHABET, pad_at="N")

# Convert to DataFrame for visualization
df_encode = pd.DataFrame(X, columns=features, index=list_seq)
aa.display_df(df=df_encode, show_shape=True)

DataFrame shape: (2, 3)

	P1	P2	P3
ABC	1	2	3
B	0	0	2