SequencePreprocessor

class SequencePreprocessor[source]

Bases: object

Preprocessing class for representing protein sequences as numeric inputs [Breimann25].

Turns raw amino-acid strings into the array forms a downstream model consumes: a one-hot or integer matrix over a fixed alphabet (via encode_one_hot() / encode_integer(), gap-padded N- or C-terminally), or fixed-length residue windows sliced around a position (via get_aa_window() / get_sliding_aa_window()). Unlike the per-residue dict_num preprocessors (EmbeddingPreprocessor, StructurePreprocessor, AnnotationPreprocessor), it works directly on sequence strings and does not feed CPP.run_num().

Added in version 1.0.0.

Methods

encode_integer(list_seq[, alphabet, gap, pad_at])

Integer-encode a list of protein sequences into a feature matrix.

encode_one_hot(list_seq[, alphabet, gap, pad_at])

One-hot-encode a list of protein sequences into a feature matrix.

get_aa_window(seq[, pos_start, pos_stop, ...])

Extracts a window of amino acids from a sequence.

get_sliding_aa_window(seq[, slide_start, ...])

Extract sliding windows of amino acids from a sequence.