aaanalysis.SequencePreprocessor.encode_integer

static SequencePreprocessor.encode_integer(list_seq=None, alphabet='ACDEFGHIKLMNPQRSTVWY', gap='-', pad_at='C')[source]

Integer-encode a list of protein sequences into a feature matrix.

Each amino acid is represented by an integer between 1 and n, where n is the number of characters. Gaps are represented by 0. Shorter sequences are padded with gaps either N- or C-terminally.

Parameters:
  • list_seq (list of str or str) – List of protein sequences to encode. All characters in each sequence must part of the alphabet or be represented by the gap.

  • alphabet (str, default='ACDEFGHIKLMNPQRSTVWY') – The alphabet of amino acids used for encoding.

  • gap (str, default='-') – The character used to represent gaps within sequences. It should not be included in the alphabet.

  • pad_at (str, default='C') –

    Specifies where to add the padding:

    • ’N’ for N-terminus (beginning of the sequence),

    • ’C’ for C-terminus (end of the sequence).

Returns:

  • X (array-like, shape (n_samples, n_residues)) – Feature matrix containing one-hot encoded position-wise representation of residues.

  • features (list of str) – List of feature names corresponding to each position in the encoded matrix.

Examples

To demonstrate integer encoding of protein sequences using the SequencePreprocessor().encode_integer() method, we first create an example sequence:

import aaanalysis as aa
import pandas as pd

list_seq = ["AACDEFGHIY", "IIHGFECDAY"]
sp = aa.SequencePreprocessor()

Provide the sequence as seq parameter to obtain a feature matrix (X) and the respective features, which are integer amino acid representation at given residue positions:

X, features = sp.encode_integer(list_seq=list_seq)

# Convert to DataFrame for visualization
df_encode = pd.DataFrame(X, columns=features, index=list_seq)
aa.display_df(df=df_encode, show_shape=True)
DataFrame shape: (2, 10)
  P1 P2 P3 P4 P5 P6 P7 P8 P9 P10
AACDEFGHIY 1 1 2 3 4 5 6 7 8 20
IIHGFECDAY 8 8 7 6 5 4 2 3 1 20

You can adjust the used alphabet to change the considered characters:

# Show integer encoding with smaller alphabet
list_seq = ["ABC", "CBA"]
ALPHABET = "ABC"
X, features = sp.encode_integer(list_seq=list_seq, alphabet=ALPHABET)

# Convert to DataFrame for visualization
df_encode = pd.DataFrame(X, columns=features, index=list_seq)
aa.display_df(df=df_encode, show_shape=True)
DataFrame shape: (2, 3)
  P1 P2 P3
ABC 1 2 3
CBA 3 2 1

Change the gap symbol (default=-) as follows:

# Show integer encoding with other gap ('*')
list_seq = ["ABC", "CB*"]
ALPHABET = "ABC"
X, features = sp.encode_integer(list_seq=list_seq, alphabet=ALPHABET, gap="*")

# Convert to DataFrame for visualization
df_encode = pd.DataFrame(X, columns=features, index=list_seq)
aa.display_df(df=df_encode, show_shape=True)
DataFrame shape: (2, 3)
  P1 P2 P3
ABC 1 2 3
CB* 3 2 0

If one sequence is smaller than the other, gaps will be included either at the N-terminus or C-terminus (default), which is called padding. Adjust the padding using the pad_at (N or C) parameter:

# Show default padding (at C-Termius)
list_seq = ["ABC", "B"]
ALPHABET = "ABC"
X, features = sp.encode_integer(list_seq=list_seq, alphabet=ALPHABET)

# Convert to DataFrame for visualization
df_encode = pd.DataFrame(X, columns=features, index=list_seq)
aa.display_df(df=df_encode, show_shape=True)
DataFrame shape: (2, 3)
  P1 P2 P3
ABC 1 2 3
B 2 0 0
# Show N-terminal padding
list_seq = ["ABC", "B"]
ALPHABET = "ABC"
X, features = sp.encode_integer(list_seq=list_seq, alphabet=ALPHABET, pad_at="N")

# Convert to DataFrame for visualization
df_encode = pd.DataFrame(X, columns=features, index=list_seq)
aa.display_df(df=df_encode, show_shape=True)
DataFrame shape: (2, 3)
  P1 P2 P3
ABC 1 2 3
B 0 0 2