load_dataset

class load_dataset(name='Overview', n=None, random=False, non_canonical_aa='remove', min_len=None, max_len=None, aa_window_size=9, verbose=False)[source]

Bases:

Load protein benchmarking datasets.

The benchmarks are grouped by the package’s three prediction levels, encoded in the dataset name prefix: residue level ('AA_*'), domain level ('DOM_*'), and protein / sequence level ('SEQ_*'). These are the same three levels scored by predict() via its level argument — AA_ ↔ level='window' (residues are represented as windows), DOM_ ↔ level='domain', and SEQ_ ↔ level='sequence' (a whole sequence, typically a protein). By default, an overview table is provided (name='Overview'). For in-depth details, refer to [Breimann24a].

Added in version 0.1.0.

Changed in version 1.1.0: Added the verbose parameter, which reports how many entries each removal step (min_len, max_len, and non_canonical_aa='remove') dropped. The returned data is unchanged. Every dataset now carries a human-readable gene column immediately after entry (the UniProt gene symbol for the domain datasets, a positional name_<row> placeholder for the amino-acid / sequence datasets); all other columns are unchanged.

Parameters:

name (str, default='Overview') – The dataset to load, given as its ‘Dataset’ name (see Notes for the full list grouped by level). The default 'Overview' returns the benchmark overview table instead of a single dataset.
n (int, optional) – Number of proteins per class, selected by index. If None, the whole dataset will be returned.
random (bool, default=False) – If True, n randomly selected proteins per class will be chosen.
non_canonical_aa ({'remove', 'keep', 'gap'}, default='remove') –
Options for handling non-canonical amino acids:
- remove: Remove sequences containing non-canonical amino acids (the count removed is reported when verbose=True). To retain every entry (e.g. for inspection), use keep.
- keep: Don’t remove sequences containing non-canonical amino acids.
- gap: Non-canonical amino acids are replaced by the gap symbol (‘-‘).
min_len (int, optional) – Minimum length of sequences for filtering. The number of entries removed is reported when verbose=True.
max_len (int, optional) – Maximum length of sequences for filtering. The number of entries removed is reported when verbose=True.
aa_window_size (int, default=9) – Length of amino acid window, only used for the amino acid dataset level (name='AA_'). Disabled if None. Must be odd, except for cleavage site datasets (e.g., ‘AA_CASPASE3’, ‘AA_FURIN’, ‘AA_MMP2’).
verbose (bool, default=False) – If True, report how many entries each removal step (min_len, max_len, and non_canonical_aa='remove') dropped. Does not change the returned data.

Returns:

df_seq (pd.DataFrame) – When name is not 'Overview': the selected sequence dataset with columns entry, gene, sequence, label (plus tmd_start, tmd_stop, jmd_n, tmd, jmd_c for domain-level datasets).
df (pd.DataFrame) – When name='Overview': a summary table of all available benchmarks (no sequence column; one row per dataset), including an 'Avg length' column (see Notes for its definition).

Notes

Available datasets (pass as name), grouped by level:

Residue level ('AA_*', amino-acid windows): ‘AA_CASPASE3’, ‘AA_FURIN’, ‘AA_LDR’, ‘AA_MMP2’, ‘AA_RNABIND’, ‘AA_SA’.
Protein / sequence level ('SEQ_*'): ‘SEQ_AMYLO’, ‘SEQ_CAPSID’, ‘SEQ_DISULFIDE’, ‘SEQ_LOCATION’, ‘SEQ_SOLUBLE’, ‘SEQ_TAIL’.
Domain level ('DOM_*'): ‘DOM_GSEC’, ‘DOM_GSEC_PU’.

See Protein Benchmark Datasets for the size, class balance, and reference predictor of each dataset.

The Overview table’s 'Avg length' column is the mean number of residues per sequence, averaged over the complete dataset (all sequences, before any non-canonical-amino-acid removal). For amino acid level ('AA_*') datasets this is the mean full-protein length, not the length of the windowed sequences returned under the default aa_window_size.

df_seq includes these columns:

‘entry’: Protein identifier, either the UniProt accession number or an id based on index.
‘gene’: Human-readable gene symbol (UniProt gene name for the domain datasets; a positional name_<row> placeholder for the amino-acid / sequence datasets, whose entries are synthetic). Lets a sample selector be resolved by gene symbol (see SequenceFeature.get_seq_kws()). Present for the domain / sequence datasets and the raw amino-acid tables; amino-acid windowing (aa_window_size) rebuilds per-position entries and does not carry it.
‘sequence’: Amino acid sequence.
‘label’: Binary classification label (0 for negatives, 1 for positives).
‘tmd_start’, ‘tmd_stop’: Start and stop positions of target middle domain (TMD) (present only at the domain level).
‘jmd_n’, ‘tmd’, ‘jmd_c’: Sequences for the N-terminal juxta middle domain (JMD), the TMD, and the C-terminal JMD respectively.

See also

Overview of all benchmarks in Protein Benchmark Datasets.
Step-by-step guide in the Data Loading Tutorial.
predict() — scores raw sequences at these same three prediction levels (AA_ → level='window', DOM_ → level='domain', SEQ_ → level='sequence').

Examples

An overview dataset table is provided as default, where the suffix in the ‘Dataset’ (‘AA’, ‘SEQ’, and ‘DOM’) column corresponds to the ‘Level’ values (‘Amino acid’, ‘Sequence’, and ‘Domain’ level). Load datasets using the load_dataset() function:

import aaanalysis as aa
df_info = aa.load_dataset()
aa.display_df(df=df_info, show_shape=True, max_height=600)

DataFrame shape: (14, 11)

	Level	Dataset	# Sequences	Avg length	# Amino acids	# Positives	# Negatives	Predictor	Description	Reference	Label
1	Amino acid	AA_CASPASE3	233	796.587983	185605	705	184900	PROSPERous	Prediction of c...3 cleavage site	Song et al., 2018	1 (adjacent to ... cleavage site)
2	Amino acid	AA_FURIN	71	831.028169	59003	163	58840	PROSPERous	Prediction of f...n cleavage site	Song et al., 2018	1 (adjacent to ... cleavage site)
3	Amino acid	AA_LDR	342	345.754386	118248	35469	82779	IDP-Seq2Seq	Prediction of l...d regions (LDR)	Tang et al., 2020	1 (disordered), 0 (ordered)
4	Amino acid	AA_MMP2	573	546.205934	312976	2416	310560	PROSPERous	Prediction of M...) cleavage site	Song et al., 2018	1 (adjacent to ... cleavage site)
5	Amino acid	AA_RNABIND	221	248.873303	55001	6492	48509	GMKSVM-RU	Prediction of R...(RBP60 dataset)	Yang et al., 2021	1 (binding), 0 (non-binding)
6	Amino acid	AA_SA	233	796.587983	185605	101082	84523	PROSPERous	Prediction of s...PASE3 data set)	Song et al., 2018	1 (exposed/acce...non-accessible)
7	Sequence	SEQ_AMYLO	1414	6.000000	8484	511	903	ReRF-Pred	Prediction of a...ognenic regions	Teng et al. 2021	1 (amyloidogeni...-amyloidogenic)
8	Sequence	SEQ_CAPSID	7935	424.030246	3364680	3864	4071	VIRALpro	Prediction of capdsid proteins	Galiez et al., 2016	1 (capsid prote...capsid protein)
9	Sequence	SEQ_DISULFIDE	2547	241.252454	614470	897	1650	Dipro	Prediction of d...es in sequences	Cheng et al., 2006	1 (sequence wit...ithout SS bond)
10	Sequence	SEQ_LOCATION	1835	399.126975	732398	1045	790	nan	Prediction of s...lasma membrane)	Shen et al., 2019	1 (protein in c...asma membrane)
11	Sequence	SEQ_SOLUBLE	17408	254.611041	4432269	8704	8704	SOLpro	Prediction of s...oluble proteins	Magnan et al., 2009	1 (soluble), 0 (insoluble)
12	Sequence	SEQ_TAIL	6668	400.673365	2671690	2574	4094	VIRALpro	Prediction of tail proteins	Galiez et al., 2016	1 (tail protein...n-tail protein)
13	Domain	DOM_GSEC	126	737.809524	92964	63	63	nan	Prediction of g...tase substrates	Breimann et al, 2024c	1 (substrate), ...(non-substrate)
14	Domain	DOM_GSEC_PU	694	712.570605	494524	63	0	nan	Prediction of g...es (PU dataset)	Breimann et al, 2024c	1 (substrate), ...bstrate status)

Load one of the datasets from the overview table by using a name from the ‘Dataset’ column (e.g., name='SEQ_CAPSID'). The number of proteins per class can be adjusted by the n parameter:

df_seq = aa.load_dataset(name="SEQ_CAPSID", n=2)
aa.display_df(df=df_seq)

	entry	gene	sequence	label
1	CAPSID_1	name_1	MVTHNVKINKHVTRR...DTPRIPATKLDEENV	0
2	CAPSID_2	name_2	MKKRQKKMTLSNFTD...AMLEAVINARHFGEE	0
3	CAPSID_4072	name_4072	MALTTNDVITEDFVR...AWKAIFPEAAVKVDA	1
4	CAPSID_4073	name_4073	MGELTDNGVQLAKAQ...TCTNPAAHAKIRDLK	1

The sampling can be performed randomly by setting random=True:

df_seq = aa.load_dataset(name="SEQ_CAPSID", n=2, random=True)
aa.display_df(df=df_seq)

	entry	gene	sequence	label
1	CAPSID_305	name_305	MKILVVEDEFDLNRS...SLIKTKRGLGYVIPK	0
2	CAPSID_1928	name_1928	MKIGLIDTHLARQQA...QGFCLERVIDAEATH	0
3	CAPSID_7428	name_7428	MALDPSEAAGIPDEL...WDLKRRPKREQLGAR	1
4	CAPSID_5504	name_5504	MTYPRRRYRRRRHRP...QFREFNLKDPPLNPK	1

Sequences with non-canonical amino acids are by default removed, which can be disabled by setting non_canonical_aa='keep' or non_canonical_aa='gap':

n_unfiltered = len(aa.load_dataset(name='SEQ_DISULFIDE', non_canonical_aa="keep"))
n = len(aa.load_dataset(name='SEQ_DISULFIDE'))
print(f"'SEQ_DISULFIDE' contain {n_unfiltered} proteins and {n} after filtering.")

'SEQ_DISULFIDE' contain 2547 proteins and 2202 after filtering.

Datasets can be filtered for the minimum and maximum sequence length using min_len and max_len:

n_len_filtered = len(aa.load_dataset(name='SEQ_DISULFIDE', min_len=100, max_len=200))
print(f"'SEQ_DISULFIDE' contain {n_unfiltered} proteins, of which {n_len_filtered} have a length between 100 and 200 residues.")

'SEQ_DISULFIDE' contain 2547 proteins, of which 644 have a length between 100 and 200 residues.

Set verbose=True to report how many entries each removal step (min_len, max_len, and non_canonical_aa='remove') drops. This only reports counts and does not change the returned data:

df_seq = aa.load_dataset(name="SEQ_DISULFIDE", min_len=100, max_len=200, verbose=True)

[94m'SEQ_DISULFIDE': removed 550 sequence(s) shorter than 'min_len' (100).[0m
[94m'SEQ_DISULFIDE': removed 1273 sequence(s) longer than 'max_len' (200).[0m
[94m'SEQ_DISULFIDE': removed 80 sequence(s) containing non-canonical amino acids.[0m

For the ‘Amino acid level’ datasets, the size of the amino acid window can be adjusted using the aa_window_size parameter:

df_aa = aa.load_dataset(name="AA_CASPASE3", n=2, aa_window_size=5)
aa.display_df(df=df_aa)

	entry	sequence	label
1	CASPASE3_1_pos2	MSLFD	0
2	CASPASE3_1_pos3	SLFDL	0
3	CASPASE3_1_pos126	LRDSM	1
4	CASPASE3_1_pos127	RDSML	1

For Positive-Unlabeled (PU) learning, datasets are provided containing only positive (labeled by ‘1’) and unlabeled data (‘2’), indicated by a ‘PU’ suffix in the ‘Dataset’ column name:

df_seq = aa.load_dataset(name="DOM_GSEC_PU", n=3)
aa.display_df(df=df_seq)

	entry	gene	sequence	label	tmd_start	tmd_stop	jmd_n	tmd	jmd_c
1	P05067	APP	MLPGLALLLLAAWTA...GYENPTYKFFEQMQN	1	701	723	FAEDVGSNKG	AIIGLMVGGVVIATVIVITLVML	KKKQYTSIHH
2	P14925	Pam	MAGRARSGLLLLLLG...EEEYSAPLPKPAPSS	1	868	890	KLSTEPGSGV	SVVLITTLLVIPVLVLLAIVMFI	RWKKSRAFGD
3	P70180	Npr3	MRSLLLFTFSACVLL...RELREDSIRSHFSVA	1	477	499	PCKSSGGLEE	SAVTGIVVGALLGAGLLMAFYFF	RKKYRITIER
4	P12821	ACE	MGAASGRRGPGLLLP...SHGPQFGSEVELRHS	2	1257	1276	GLDLDAQQAR	VGQWLLLFLGIALLVATLGL	SQRLFSIRHR
5	P36896	ACVR1B	MAESAGASSFFPLVV...KKTLSQLSVQEDVKI	2	127	149	EHPSMWGPVE	LVGIIAGPVFLLFLIIIIVFLVI	NYHQRVYHNR
6	Q8NER5	ACVR1C	MTRALCSALRQALLL...KKTISQLCVKEDCKA	2	114	136	PNAPKLGPME	LAIIITVPVCLLSIAAMLTVWAC	QGRQCSYRKK