aaanalysis.load_dataset
- class aaanalysis.load_dataset(name='Overview', n=None, random=False, non_canonical_aa='remove', min_len=None, max_len=None, aa_window_size=9)[source]
Bases:
Load protein benchmarking datasets.
The benchmarks are categorized into amino acid (‘AA’), domain (‘DOM’), and sequence (‘SEQ’) level datasets. By default, an overview table is provided (
name='Overview'). For in-depth details, refer to [Breimann24a].Added in version 0.1.0.
- Parameters:
name (str, default='Overview') – The dataset to load, given as its ‘Dataset’ name (see Notes for the full list grouped by level). The default
'Overview'returns the benchmark overview table instead of a single dataset.n (int, optional) – Number of proteins per class, selected by index. If
None, the whole dataset will be returned.random (bool, default=False) – If
True,nrandomly selected proteins per class will be chosen.non_canonical_aa ({'remove', 'keep', 'gap'}, default='remove') –
Options for handling non-canonical amino acids:
remove: Remove sequences containing non-canonical amino acids.keep: Don’t remove sequences containing non-canonical amino acids.gap: Non-canonical amino acids are replaced by the gap symbol (‘X’).
min_len (int, optional) – Minimum length of sequences for filtering.
max_len (int, optional) – Maximum length of sequences for filtering.
aa_window_size (int, default=9) – Length of amino acid window, only used for the amino acid dataset level (
name='AA_'). Disabled ifNone. Must be odd, except for cleavage site datasets (e.g., ‘AA_CASPASE3’, ‘AA_FURIN’, ‘AA_MMP2’).
- Returns:
df – A DataFrame of either the selected sequence dataset (
df_seq) or overview on all benchmark datasets (df_overview).- Return type:
Notes
Available datasets (pass as
name), grouped by level:Amino acid level (
'AA_*'): ‘AA_CASPASE3’, ‘AA_FURIN’, ‘AA_LDR’, ‘AA_MMP2’, ‘AA_RNABIND’, ‘AA_SA’.Sequence level (
'SEQ_*'): ‘SEQ_AMYLO’, ‘SEQ_CAPSID’, ‘SEQ_DISULFIDE’, ‘SEQ_LOCATION’, ‘SEQ_SOLUBLE’, ‘SEQ_TAIL’.Domain level (
'DOM_*'): ‘DOM_GSEC’, ‘DOM_GSEC_PU’.
See Protein Benchmark Datasets for the size, class balance, and reference predictor of each dataset.
df_seqincludes these columns:‘entry’: Protein identifier, either the UniProt accession number or an id based on index.
‘sequence’: Amino acid sequence.
‘label’: Binary classification label (0 for negatives, 1 for positives).
‘tmd_start’, ‘tmd_stop’: Start and stop positions of TMD (present only at the domain level).
‘jmd_n’, ‘tmd’, ‘jmd_c’: Sequences for JMD_N, TMD, and JMD_C respectively.
See also
Overview of all benchmarks in Protein Benchmark Datasets.
Step-by-step guide in the Data Loading Tutorial.
Examples
An overview dataset table is provided as default, where the suffix in the ‘Dataset’ (‘AA’, ‘SEQ’, and ‘DOM’) column corresponds to the ‘Level’ values (‘Amino acid’, ‘Sequence’, and ‘Domain’ level). Load datasets using the
load_dataset()function:import aaanalysis as aa df_info = aa.load_dataset() aa.display_df(df=df_info, show_shape=True, max_height=600)
DataFrame shape: (14, 11)
Level Dataset # Sequences Avg length # Amino acids # Positives # Negatives Predictor Description Reference Label 1 Amino acid AA_CASPASE3 233 796.587983 185605 705 184900 PROSPERous Prediction of c...3 cleavage site Song et al., 2018 1 (adjacent to ... cleavage site) 2 Amino acid AA_FURIN 71 831.028169 59003 163 58840 PROSPERous Prediction of f...n cleavage site Song et al., 2018 1 (adjacent to ... cleavage site) 3 Amino acid AA_LDR 342 345.754386 118248 35469 82779 IDP-Seq2Seq Prediction of l...d regions (LDR) Tang et al., 2020 1 (disordered), 0 (ordered) 4 Amino acid AA_MMP2 573 546.205934 312976 2416 310560 PROSPERous Prediction of M...) cleavage site Song et al., 2018 1 (adjacent to ... cleavage site) 5 Amino acid AA_RNABIND 221 248.873303 55001 6492 48509 GMKSVM-RU Prediction of R...(RBP60 dataset) Yang et al., 2021 1 (binding), 0 (non-binding) 6 Amino acid AA_SA 233 796.587983 185605 101082 84523 PROSPERous Prediction of s...PASE3 data set) Song et al., 2018 1 (exposed/acce...non-accessible) 7 Sequence SEQ_AMYLO 1414 6.000000 8484 511 903 ReRF-Pred Prediction of a...ognenic regions Teng et al. 2021 1 (amyloidogeni...-amyloidogenic) 8 Sequence SEQ_CAPSID 7935 424.030246 3364680 3864 4071 VIRALpro Prediction of capdsid proteins Galiez et al., 2016 1 (capsid prote...capsid protein) 9 Sequence SEQ_DISULFIDE 2547 241.252454 614470 897 1650 Dipro Prediction of d...es in sequences Cheng et al., 2006 1 (sequence wit...ithout SS bond) 10 Sequence SEQ_LOCATION 1835 399.126975 732398 1045 790 nan Prediction of s...lasma membrane) Shen et al., 2019 1 (protein in c...asma membrane) 11 Sequence SEQ_SOLUBLE 17408 254.611041 4432269 8704 8704 SOLpro Prediction of s...oluble proteins Magnan et al., 2009 1 (soluble), 0 (insoluble) 12 Sequence SEQ_TAIL 6668 400.673365 2671690 2574 4094 VIRALpro Prediction of tail proteins Galiez et al., 2016 1 (tail protein...n-tail protein) 13 Domain DOM_GSEC 126 737.809524 92964 63 63 nan Prediction of g...tase substrates Breimann et al, 2024c 1 (substrate), ...(non-substrate) 14 Domain DOM_GSEC_PU 694 712.570605 494524 63 0 nan Prediction of g...es (PU dataset) Breimann et al, 2024c 1 (substrate), ...bstrate status) Load one of the datasets from the overview table by using a name from the ‘Dataset’ column (e.g.,
name='SEQ_CAPSID'). The number of proteins per class can be adjusted by thenparameter:df_seq = aa.load_dataset(name="SEQ_CAPSID", n=2) aa.display_df(df=df_seq)
entry sequence label 1 CAPSID_1 MVTHNVKINKHVTRR...DTPRIPATKLDEENV 0 2 CAPSID_2 MKKRQKKMTLSNFTD...AMLEAVINARHFGEE 0 3 CAPSID_4072 MALTTNDVITEDFVR...AWKAIFPEAAVKVDA 1 4 CAPSID_4073 MGELTDNGVQLAKAQ...TCTNPAAHAKIRDLK 1 The sampling can be performed randomly by setting
random=True:df_seq = aa.load_dataset(name="SEQ_CAPSID", n=2, random=True) aa.display_df(df=df_seq)
entry sequence label 1 CAPSID_1080 MKFTQFGEKFTRYSG...GINIIAEEVLKAYSE 0 2 CAPSID_3847 MLAELLSTFRRRPPE...KRAATGYGEGGRRRG 0 3 CAPSID_6263 MANYQDIAVEFAGDL...QNVVAAARVFRGTGV 1 4 CAPSID_5423 MGALLAVIAEVAEVS...TTPHRSSKTYSKRRH 1 Sequences with non-canonical amino acids are by default removed, which can be disabled by setting
non_canonical_aa='keep'ornon_canonical_aa='gap':n_unfiltered = len(aa.load_dataset(name='SEQ_DISULFIDE', non_canonical_aa="keep")) n = len(aa.load_dataset(name='SEQ_DISULFIDE')) print(f"'SEQ_DISULFIDE' contain {n_unfiltered} proteins and {n} after filtering.")'SEQ_DISULFIDE' contain 2547 proteins and 2202 after filtering.
Datasets can be filtered for the minimum and maximum sequence length using
min_lenandmax_len:n_len_filtered = len(aa.load_dataset(name='SEQ_DISULFIDE', min_len=100, max_len=200)) print(f"'SEQ_DISULFIDE' contain {n_unfiltered} proteins, of which {n_len_filtered} have a length between 100 and 200 residues.")'SEQ_DISULFIDE' contain 2547 proteins, of which 644 have a length between 100 and 200 residues.
For the ‘Amino acid level’ datasets, the size of the amino acid window can be adjusted using the
aa_window_sizeparameter:df_aa = aa.load_dataset(name="AA_CASPASE3", n=2, aa_window_size=5) aa.display_df(df=df_aa)
entry sequence label 1 CASPASE3_1_pos126 LRDSM 1 2 CASPASE3_1_pos127 RDSML 1 3 CASPASE3_1_pos2 MSLFD 0 4 CASPASE3_1_pos3 SLFDL 0 For Positive-Unlabeled (PU) learning, datasets are provided containing only positive (labeled by ‘1’) and unlabeled data (‘2’), indicated by a ‘PU’ suffix in the ‘Dataset’ column name:
df_seq = aa.load_dataset(name="DOM_GSEC_PU", n=3) aa.display_df(df=df_seq)
entry sequence label tmd_start tmd_stop jmd_n tmd jmd_c 1 P05067 MLPGLALLLLAAWTA...GYENPTYKFFEQMQN 1 701 723 FAEDVGSNKG AIIGLMVGGVVIATVIVITLVML KKKQYTSIHH 2 P14925 MAGRARSGLLLLLLG...EEEYSAPLPKPAPSS 1 868 890 KLSTEPGSGV SVVLITTLLVIPVLVLLAIVMFI RWKKSRAFGD 3 P70180 MRSLLLFTFSACVLL...RELREDSIRSHFSVA 1 477 499 PCKSSGGLEE SAVTGIVVGALLGAGLLMAFYFF RKKYRITIER 4 P12821 MGAASGRRGPGLLLP...SHGPQFGSEVELRHS 2 1257 1276 GLDLDAQQAR VGQWLLLFLGIALLVATLGL SQRLFSIRHR 5 P36896 MAESAGASSFFPLVV...KKTLSQLSVQEDVKI 2 127 149 EHPSMWGPVE LVGIIAGPVFLLFLIIIIVFLVI NYHQRVYHNR 6 Q8NER5 MTRALCSALRQALLL...KKTISQLCVKEDCKA 2 114 136 PNAPKLGPME LAIIITVPVCLLSIAAMLTVWAC QGRQCSYRKK