SequencePreprocessor
- class SequencePreprocessor[source]
Bases:
objectPreprocessing class for representing protein sequences as numeric inputs [Breimann25].
Turns raw amino-acid strings into the array forms a downstream model consumes: a one-hot or integer matrix over a fixed alphabet (via
encode_one_hot()/encode_integer(), gap-padded N- or C-terminally), or fixed-length residue windows sliced around a position (viaget_aa_window()/get_sliding_aa_window()). Unlike the per-residuedict_numpreprocessors (EmbeddingPreprocessor,StructurePreprocessor,AnnotationPreprocessor), it works directly on sequence strings and does not feedCPP.run_num().Added in version 1.0.0.
Methods
encode_integer(list_seq[, alphabet, gap, pad_at])Integer-encode a list of protein sequences into a feature matrix.
encode_one_hot(list_seq[, alphabet, gap, pad_at])One-hot-encode a list of protein sequences into a feature matrix.
get_aa_window(seq[, pos_start, pos_stop, ...])Extracts a window of amino acids from a sequence.
get_sliding_aa_window(seq[, slide_start, ...])Extract sliding windows of amino acids from a sequence.