Data loading

This is a tutorial on loading of protein benchmark datasets.

Loading of protein benchmarks

Load the overview table of protein benchmark datasets using the default settings:

import aaanalysis as aa
df_info = aa.load_dataset()
aa.display_df(df=df_info)
  Level Dataset # Sequences Avg length # Amino acids # Positives # Negatives Predictor Description Reference Label
1 Amino acid AA_CASPASE3 233 796.587983 185605 705 184900 PROSPERous Prediction of c...3 cleavage site Song et al., 2018 1 (adjacent to ... cleavage site)
2 Amino acid AA_FURIN 71 831.028169 59003 163 58840 PROSPERous Prediction of f...n cleavage site Song et al., 2018 1 (adjacent to ... cleavage site)
3 Amino acid AA_LDR 342 345.754386 118248 35469 82779 IDP-Seq2Seq Prediction of l...d regions (LDR) Tang et al., 2020 1 (disordered), 0 (ordered)
4 Amino acid AA_MMP2 573 546.205934 312976 2416 310560 PROSPERous Prediction of M...) cleavage site Song et al., 2018 1 (adjacent to ... cleavage site)
5 Amino acid AA_RNABIND 221 248.873303 55001 6492 48509 GMKSVM-RU Prediction of R...(RBP60 dataset) Yang et al., 2021 1 (binding), 0 (non-binding)
6 Amino acid AA_SA 233 796.587983 185605 101082 84523 PROSPERous Prediction of s...PASE3 data set) Song et al., 2018 1 (exposed/acce...non-accessible)
7 Sequence SEQ_AMYLO 1414 6.000000 8484 511 903 ReRF-Pred Prediction of a...ognenic regions Teng et al. 2021 1 (amyloidogeni...-amyloidogenic)
8 Sequence SEQ_CAPSID 7935 424.030246 3364680 3864 4071 VIRALpro Prediction of capdsid proteins Galiez et al., 2016 1 (capsid prote...capsid protein)
9 Sequence SEQ_DISULFIDE 2547 241.252454 614470 897 1650 Dipro Prediction of d...es in sequences Cheng et al., 2006 1 (sequence wit...ithout SS bond)
10 Sequence SEQ_LOCATION 1835 399.126975 732398 1045 790 nan Prediction of s...lasma membrane) Shen et al., 2019 1 (protein in c...asma membrane)
11 Sequence SEQ_SOLUBLE 17408 254.611041 4432269 8704 8704 SOLpro Prediction of s...oluble proteins Magnan et al., 2009 1 (soluble), 0 (insoluble)
12 Sequence SEQ_TAIL 6668 400.673365 2671690 2574 4094 VIRALpro Prediction of tail proteins Galiez et al., 2016 1 (tail protein...n-tail protein)
13 Domain DOM_GSEC 126 737.809524 92964 63 63 nan Prediction of g...tase substrates Breimann et al, 2024c 1 (substrate), ...(non-substrate)
14 Domain DOM_GSEC_PU 694 712.570605 494524 63 0 nan Prediction of g...es (PU dataset) Breimann et al, 2024c 1 (substrate), ...bstrate status)

The benchmark datasets are categorized into amino acid (‘AA’), domain (‘DOM’), and sequence (‘SEQ’) level datasets, indicated by their name prefix, as exemplified here.

df_seq1 = aa.load_dataset(name="AA_CASPASE3")
df_seq2 = aa.load_dataset(name="SEQ_CAPSID")
df_seq3 = aa.load_dataset(name="DOM_GSEC")
aa.display_df(df=df_seq3, n_rows=8, char_limit=25)
# Compare columns of three types
  entry sequence label tmd_start tmd_stop jmd_n tmd jmd_c
1 P05067 MLPGLALLLLAA...NPTYKFFEQMQN 1 701 723 FAEDVGSNKG AIIGLMVGGVVIATVIVITLVML KKKQYTSIHH
2 P14925 MAGRARSGLLLL...YSAPLPKPAPSS 1 868 890 KLSTEPGSGV SVVLITTLLVIPVLVLLAIVMFI RWKKSRAFGD
3 P70180 MRSLLLFTFSAC...REDSIRSHFSVA 1 477 499 PCKSSGGLEE SAVTGIVVGALLGAGLLMAFYFF RKKYRITIER
4 Q03157 MGPTSPAARGQG...ENPTYRFLEERP 1 585 607 APSGTGVSRE ALSGLLIMGAGGGSLIVLSLLLL RKKKPYGTIS
5 Q06481 MAATGTAAAAAT...NPTYKYLEQMQI 1 694 716 LREDFSLSSS ALIGLLVIAVAIATVIVISLVML RKRQYGTISH
6 P35613 MAAALFVLLGFA...DKGKNVRQRNSS 1 323 345 IITLRVRSHL AALWPFLGIVAEVLVLVTIIFIY EKRRKPEDVL
7 P35070 MDRAARCSGASS...PINEDIEETNIA 1 119 141 LFYLRGDRGQ ILVICLIAVMVVFIILVIGVCTC CHPLRKRRKR
8 P09803 MGARCRSFSALL...KLADMYGGGEDD 1 711 733 GIVAAGLQVP AILGILGGILALLILILLLLLFL RRRTVVKEPL

Each dataset can be utilized for a binary classification, with labels being positive (1) or negative (0). A balanced number of samples can be chosen by the n parameter, defining the sample number per class.

# Returns 200 samples, the first 100 positives and the first 100 negatives
df_seq = aa.load_dataset(name="SEQ_CAPSID", n=100)
aa.display_df(df=df_seq, n_rows=8, char_limit=25)
  entry sequence label
1 CAPSID_1 MVTHNVKINKHV...RIPATKLDEENV 0
2 CAPSID_2 MKKRQKKMTLSN...EAVINARHFGEE 0
3 CAPSID_3 MRYGGSVISQEL...GAPDKQEVELVD 0
4 CAPSID_4 MERGDIPFKYVG...LEELAEMDAGLI 0
5 CAPSID_5 MKRIYLLFAALI...CVGEPITELTHQ 0
6 CAPSID_6 MARARESQAEAL...RNTISSRISRTW 0
7 CAPSID_8 MLFKQLFTIISS...VDIIKALIIIKG 0
8 CAPSID_9 MKILIVEDEPKT...GMGYVLEAPQPR 0

Or randomly selected using random=True:

# Returns randoms samples, 100 positives and 100 negatives
df_seq = aa.load_dataset(name="SEQ_CAPSID", n=100, random=True)
aa.display_df(df=df_seq, n_rows=8, char_limit=25)
  entry sequence label
1 CAPSID_2351 MESIRGAIWEDR...LHPIEEGNTIHK 0
2 CAPSID_149 MEMVSGSGSYTA...LRRRQQQDNGTV 0
3 CAPSID_1161 MFNIDEVYTISS...ALYSNIEKIEKN 0
4 CAPSID_3899 MAKSLTQILGQA...DWLNDQPRGKSE 0
5 CAPSID_1424 MTPQATANSLFG...KPPEEEGNGKLE 0
6 CAPSID_762 MIAGNCRMCLVE...AAGEMSPQALYG 0
7 CAPSID_3590 MVFDLLRRKTGR...GLPPRTAEVADD 0
8 CAPSID_3467 MAEPEYLSTRQA...LLGRIAARQRAS 0

The protein sequences can have varying length:

# Plot distribution
import matplotlib.pyplot as plt
import seaborn as sns

# Utility AAanalysis function for publication ready plots
aa.plot_settings(font_scale=1.2)
df_seq = aa.load_dataset(name="SEQ_CAPSID", n=100)
list_seq_lens = df_seq["sequence"].apply(len)
sns.histplot(list_seq_lens, binwidth=50)
sns.despine()
plt.xlim(0, 1500)
plt.show()
../_images/tutorial2a_data_loader_1_output_9_0.png

Which can be easily filtered using min_len and max_len parameters:

df_seq = aa.load_dataset(name="SEQ_CAPSID", n=100, min_len=200, max_len=800)
list_seq_lens = df_seq["sequence"].apply(len)
aa.plot_settings(font_scale=1.2)  # Utility AAanalysis function for publication ready plots
sns.histplot(list_seq_lens, binwidth=50)
sns.despine()
plt.xlim(0, 1500)
plt.show()
../_images/tutorial2a_data_loader_2_output_11_0.png

Loading of protein benchmarks: Amino acid window size

For amino acid level datasets, labels are provided for each residue position, which can be seen by setting aa_window_size=None:

df_seq = aa.load_dataset(name="AA_CASPASE3", aa_window_size=None)
aa.display_df(df=df_seq.head(10), char_limit=25)
  entry sequence label
1 CASPASE3_1 MSLFDLFRGFFG...LDLFLGRWFRSR 0,0,0,0,0,0,...,0,0,0,0,0,0
2 CASPASE3_2 MEVTGDAGVPES...LQNPKRARQDPT 0,0,0,0,0,0,...,0,0,0,0,0,0
3 CASPASE3_3 MRARSGARGALL...EMLVAMTTDGDC 0,0,0,0,0,0,...,0,0,0,0,0,0
4 CASPASE3_4 MDAKARNCLLQH...NLGILYILQTLE 0,0,0,0,0,0,...,0,0,0,0,0,0
5 CASPASE3_5 MTSFSTSAQCST...KEIQLVIKVFIA 0,0,0,0,0,0,...,0,0,0,0,0,0
6 CASPASE3_6 MGLGASSEQPAG...PDPEPGLCEGPW 0,0,0,0,0,0,...,0,0,0,0,0,0
7 CASPASE3_7 MANQVNGNAVQL...EFYQDTYGQQWK 0,0,0,0,0,0,...,0,0,0,0,0,0
8 CASPASE3_8 MAKQPSDVSSEC...LRYIVRLVWRMH 0,0,0,0,0,0,...,0,0,0,0,0,0
9 CASPASE3_9 MCTALSPKVRSG...VSASYKAKKEIK 0,0,0,0,0,0,...,0,0,0,0,0,0
10 CASPASE3_10 MFYAHFVLSKRG...IIATPGPRFHII 0,0,0,0,0,0,...,0,0,0,0,0,0

For convenience, we provide an “amino acid window” of length n. This window represents a specific amino acid, which is flanked by (n-1)/2 residues on both its N-terminal and C-terminal sides. It’s essential for n to be odd, ensuring equal residues on both sides. While the default window size is 9, sizes between 5 and 15 are also popular.

df_seq = aa.load_dataset(name="AA_CASPASE3", n=2)
aa.display_df(df=df_seq, char_limit=25)
  entry sequence label
1 CASPASE3_1_pos126 QTLRDSMLK 1
2 CASPASE3_1_pos127 TLRDSMLKY 1
3 CASPASE3_1_pos4 MSLFDLFRG 0
4 CASPASE3_1_pos5 SLFDLFRGF 0

Sequences can be pre-filtered using min_len and max_len and n residues can be randomly selected by random with different aa_window_sizes.

df_seq = aa.load_dataset(name="AA_CASPASE3", min_len=20, n=2, random=True, aa_window_size=21)
aa.display_df(df=df_seq, char_limit=25)
  entry sequence label
1 CASPASE3_17_pos29 HWPARVDEVPDGAVKPPTNKL 1
2 CASPASE3_36_pos96 TGKRAAEDDEDDDVDTKKQKT 1
3 CASPASE3_4_pos133 VVFVTRKKLVNAIQQKLSKLK 0
4 CASPASE3_199_pos398 KQNGPKTPVHSSGDMVQPLSP 0

Loading of protein benchmarks: Positive-Unlabeled (PU) datasets

In typical binary classification, data is labeled as positive (1) or negative (0). But with many protein sequence datasets, we face challenges: they might be small, unbalanced, or lack a clear negative class. For datasets with only positive and unlabeled samples (2), we use PU learning. This approach identifies reliable negatives from the unlabeled data to make binary classification possible. We offer benchmark datasets for this scenario, denoted by the _PU suffix. For example, the DOM_GSEC_PU dataset corresponds to the DOM_GSEC set.

df_seq = aa.load_dataset(name="DOM_GSEC", n=2)
aa.display_df(df=df_seq, char_limit=25)
  entry sequence label tmd_start tmd_stop jmd_n tmd jmd_c
1 Q14802 MQKVTLGLLVFL...TPPLITPGSAQS 0 37 59 NSPFYYDWHS LQVGGLICAGVLCAMGIIIVMSA KCKCKFGQKS
2 Q86UE4 MAARSWQDELAQ...QIKKKKKARRET 0 50 72 LGLEPKRYPG WVILVGTGALGLLLLFLLGYGWA AACAGARKKR
3 P05067 MLPGLALLLLAA...NPTYKFFEQMQN 1 701 723 FAEDVGSNKG AIIGLMVGGVVIATVIVITLVML KKKQYTSIHH
4 P14925 MAGRARSGLLLL...YSAPLPKPAPSS 1 868 890 KLSTEPGSGV SVVLITTLLVIPVLVLLAIVMFI RWKKSRAFGD
df_seq_pu = aa.load_dataset(name="DOM_GSEC_PU", n=2)
aa.display_df(df=df_seq_pu, char_limit=25)
  entry sequence label tmd_start tmd_stop jmd_n tmd jmd_c
1 P05067 MLPGLALLLLAA...NPTYKFFEQMQN 1 701 723 FAEDVGSNKG AIIGLMVGGVVIATVIVITLVML KKKQYTSIHH
2 P14925 MAGRARSGLLLL...YSAPLPKPAPSS 1 868 890 KLSTEPGSGV SVVLITTLLVIPVLVLLAIVMFI RWKKSRAFGD
3 P12821 MGAASGRRGPGL...PQFGSEVELRHS 2 1257 1276 GLDLDAQQAR VGQWLLLFLGIALLVATLGL SQRLFSIRHR
4 P36896 MAESAGASSFFP...LSQLSVQEDVKI 2 127 149 EHPSMWGPVE LVGIIAGPVFLLFLIIIIVFLVI NYHQRVYHNR

See out Data Handling API and the Tables section for more details.