aaanalysis.SequencePreprocessor.encode_integer
- static SequencePreprocessor.encode_integer(list_seq=None, alphabet='ACDEFGHIKLMNPQRSTVWY', gap='-', pad_at='C')[source]
Integer-encode a list of protein sequences into a feature matrix.
Each amino acid is represented by an integer between 1 and n, where n is the number of characters. Gaps are represented by 0. Shorter sequences are padded with gaps either N- or C-terminally.
- Parameters:
list_seq (list of str or str) – List of protein sequences to encode. All characters in each sequence must part of the
alphabetor be represented by thegap.alphabet (str, default='ACDEFGHIKLMNPQRSTVWY') – The alphabet of amino acids used for encoding.
gap (str, default='-') – The character used to represent gaps within sequences. It should not be included in the
alphabet.pad_at (str, default='C') –
Specifies where to add the padding:
’N’ for N-terminus (beginning of the sequence),
’C’ for C-terminus (end of the sequence).
- Returns:
X (array-like, shape (n_samples, n_residues)) – Feature matrix containing one-hot encoded position-wise representation of residues.
features (list of str) – List of feature names corresponding to each position in the encoded matrix.
Examples
To demonstrate integer encoding of protein sequences using the
SequencePreprocessor().encode_integer()method, we first create an example sequence:import aaanalysis as aa import pandas as pd list_seq = ["AACDEFGHIY", "IIHGFECDAY"] sp = aa.SequencePreprocessor()
Provide the sequence as
seqparameter to obtain a feature matrix (X) and the respectivefeatures, which are integer amino acid representation at given residue positions:X, features = sp.encode_integer(list_seq=list_seq) # Convert to DataFrame for visualization df_encode = pd.DataFrame(X, columns=features, index=list_seq) aa.display_df(df=df_encode, show_shape=True)
DataFrame shape: (2, 10)
P1 P2 P3 P4 P5 P6 P7 P8 P9 P10 AACDEFGHIY 1 1 2 3 4 5 6 7 8 20 IIHGFECDAY 8 8 7 6 5 4 2 3 1 20 You can adjust the used
alphabetto change the considered characters:# Show integer encoding with smaller alphabet list_seq = ["ABC", "CBA"] ALPHABET = "ABC" X, features = sp.encode_integer(list_seq=list_seq, alphabet=ALPHABET) # Convert to DataFrame for visualization df_encode = pd.DataFrame(X, columns=features, index=list_seq) aa.display_df(df=df_encode, show_shape=True)
DataFrame shape: (2, 3)
P1 P2 P3 ABC 1 2 3 CBA 3 2 1 Change the
gapsymbol (default=-) as follows:# Show integer encoding with other gap ('*') list_seq = ["ABC", "CB*"] ALPHABET = "ABC" X, features = sp.encode_integer(list_seq=list_seq, alphabet=ALPHABET, gap="*") # Convert to DataFrame for visualization df_encode = pd.DataFrame(X, columns=features, index=list_seq) aa.display_df(df=df_encode, show_shape=True)DataFrame shape: (2, 3)
P1 P2 P3 ABC 1 2 3 CB* 3 2 0 If one sequence is smaller than the other, gaps will be included either at the N-terminus or C-terminus (default), which is called padding. Adjust the padding using the
pad_at(NorC) parameter:# Show default padding (at C-Termius) list_seq = ["ABC", "B"] ALPHABET = "ABC" X, features = sp.encode_integer(list_seq=list_seq, alphabet=ALPHABET) # Convert to DataFrame for visualization df_encode = pd.DataFrame(X, columns=features, index=list_seq) aa.display_df(df=df_encode, show_shape=True)
DataFrame shape: (2, 3)
P1 P2 P3 ABC 1 2 3 B 2 0 0 # Show N-terminal padding list_seq = ["ABC", "B"] ALPHABET = "ABC" X, features = sp.encode_integer(list_seq=list_seq, alphabet=ALPHABET, pad_at="N") # Convert to DataFrame for visualization df_encode = pd.DataFrame(X, columns=features, index=list_seq) aa.display_df(df=df_encode, show_shape=True)
DataFrame shape: (2, 3)
P1 P2 P3 ABC 1 2 3 B 0 0 2