aaanalysis.dPULearn.fit
- dPULearn.fit(X, labels=None, n_unl_to_neg=None, metric=None, n_components=0.8)[source]
Fit the dPULearn model to identify reliable negative samples (labeled by 0) from unlabeled samples (2) based on the distance to positive samples (1).
Use the
dPUlearn.labels_attribute to retrieve the output labels of samples inXincluding identified negatives.- Parameters:
X (array-like, shape (n_samples, n_features)) – Feature matrix. Rows typically correspond to proteins and columns to features.
labels (array-like, shape (n_samples,)) – Dataset labels of samples in
X. Should be either 1 (positive) or 2 (unlabeled).n_unl_to_neg (int, default=1) – Number of negative samples (0) to be reliably identified from unlabeled samples (2). Should be < n unlabeled samples.
metric (str or None, optional) –
The distance metric to use. If
None, PCA-based identification is performed. For distance-based identification one of the following measures can be selected:euclidean: Euclidean distance (minimum)manhattan: Manhattan distance (minimum)cosine: Cosine distance (minimum)
n_components (int or float, default=0.80) –
Number of principal components (a) or the percentage of total variance to be covered (b) when PCA is applied.
In case (a): it should be an integer >= 1.
In case (b): it should be a float with 0.0 <
n_components< 1.0.
- Returns:
The fitted instance of the dPULearn class, allowing direct attribute access.
- Return type:
Notes
If a distance metric is specified, dPUlearn performs distance-based instead of PCA-based identification.
When selecting a distance metric for distance-based identification, consider the dimensionality of the feature space, determined by the ratio of the number of features (n_features) to the number of samples (n_samples) in X. In a low-dimensional space, there are fewer features than samples (n_features < n_samples), whereas a high-dimensional space has significantly more features than samples (n_features >> n_samples). The choice of metric depends on the specific application, with the following general guidelines:
euclidean: Effective in low-dimensional spaces or when direct distances are meaningful.manhattan: Useful when differences along individual dimensions are important, or in the presence of outliers.cosine: Recommended for high-dimensional spaces (e.g., n_features >> n_samples), as it evaluates the direction of feature vectors between data points rather than the magnitude of their differences.
Warning
When setting
n_componentsas a percentage of total variance (i.e., a float between 0.0 and 1.0), caution is needed if the explained variance per principal component (PC) is low. Selecting too many PCs with low explained variance may introduce noise and lead to the selection of outliers rather than true negatives.To mitigate this, users can alternatively set
n_componentsas an integer (≥1) to explicitly limit the number of PCs used.
See also
See scikit-learn for details the three different pairwise distance measures.
See [Hastie09] for a detailed explanation on feature space and high-dimensional problems.
Examples
To demonstrate the
dPULearn().fit()method, we create a small example dataset containing positive (1) and unlabeled (2) data samples:import aaanalysis as aa import pandas as pd import numpy as np aa.options["verbose"] = False X = np.array([[0.2, 0.1], [0.25, 0.2], [0.2, 0.3], [0.5, 0.7]]) labels = np.array([1, 2, 2, 2])
Use
dPULearnwith default Principal Component Analysis (PCA) to obtain a defined number of reliable negatives samples (0) by specifying then_unl_to_negparameter:dpul = aa.dPULearn() dpul.fit(X=X, labels=labels, n_unl_to_neg=1) df_pu = dpul.df_pu_ labels = dpul.labels_ # Updated labels aa.display_df(df_pu)
selection_via PC1 (100.0%) PC1 (100.0%)_abs_dif 1 None -0.400000 0.000000 2 None -0.200000 0.200000 3 None 0.400000 0.800000 4 PC1 0.800000 1.200000 As a real-world example, you can load our γ-secretase substrate prediction dataset containing substrates (positive samples, 1) and a redundancy-reduced set of single-span type I transmembrane proteins with unknown substrates status (unlabeled samples, 2):
df_seq = aa.load_dataset(name="DOM_GSEC_PU") labels = df_seq["label"].to_numpy() n_pos = sum([x == 1 for x in labels]) # Get number of positive samples aa.display_df(df=df_seq.tail(5), show_shape=True, n_cols=5)
DataFrame shape: (5, 8)
entry sequence label tmd_start tmd_stop 690 P60852 MAGGSATTWGYPVAL...LSQTWAQKLWESNRQ 2 602 624 691 P20239 MARWQRKASVSSPCG...FICYLYKKRTIRFNH 2 684 703 692 P21754 MELSYRLFICLLLWG...TRRCRTASHPVSASE 2 387 409 693 Q12836 MWLLRCVLLCVSLSL...LAVKKQKSCPDQMCQ 2 506 528 694 Q8TCW7 MEQIWLLLLLTIRVL...PTSLVLNGIRNPVFD 2 374 396 Using the respective features, we can create a feature matrix and obtain ‘reliable’ non-substrates by dPULearn:
df_feat = aa.load_features(name="DOM_GSEC") sf = aa.SequenceFeature() df_parts = sf.get_df_parts(df_seq=df_seq) X = sf.feature_matrix(features=df_feat["feature"], df_parts=df_parts) # Number of positive (1) and unlabeled (2) samples print(pd.Series(labels).value_counts()) dpul.fit(X=X, labels=labels, n_unl_to_neg=n_pos) df_pu = dpul.df_pu_ new_labels = dpul.labels_ # Number of updated labels containing reliable negatives (0) print(pd.Series(new_labels).value_counts()) # Show only selected entries df = df_pu[df_pu["selection_via"].str.contains("PC", na=False)] aa.display_df(df=df, show_shape=True, n_rows=20, n_cols=5)2 631 1 63 Name: count, dtype: int64 2 568 1 63 0 63 Name: count, dtype: int64 DataFrame shape: (63, 15)
selection_via PC1 (56.2%) PC2 (7.4%) PC3 (2.9%) PC4 (2.8%) 81 PC3 0.033600 0.007300 0.098200 -0.007800 82 PC7 0.033400 -0.041100 0.033500 -0.005200 84 PC1 0.021000 -0.047800 0.075200 -0.005400 90 PC4 0.039000 -0.032000 -0.001300 0.110900 95 PC2 0.032000 -0.082100 0.025800 -0.037700 109 PC1 0.026100 -0.058500 0.075700 -0.020900 149 PC1 0.026500 -0.038000 0.019100 0.045500 158 PC1 0.023500 -0.060700 0.054000 0.000900 161 PC1 0.025900 0.031400 0.044900 0.055400 169 PC1 0.026500 -0.009900 0.012500 -0.016700 170 PC1 0.026100 -0.035300 0.058300 0.025800 187 PC1 0.026100 0.018800 0.050600 0.038600 192 PC6 0.040100 -0.002200 0.004300 -0.053600 193 PC1 0.024700 -0.056900 0.051300 -0.035600 195 PC5 0.029900 0.006500 0.035800 0.050200 200 PC1 0.021200 -0.056200 0.005700 0.072600 204 PC1 0.025500 -0.007100 0.062900 -0.052500 223 PC1 0.018800 -0.043600 0.048500 -0.072700 254 PC1 0.021500 -0.012900 0.071500 0.038500 264 PC4 0.040500 0.023100 -0.024700 0.113800 Since
dPULearn().fit()returns the fitted model, list comprehension can be utilized to create results for various settings of an_componentes. If given as a float > 0 and < 1, this parameter represents the percentage of total variance to be retained by principal component analysis (PCA).list_labels = [dpul.fit(X=X, labels=labels, n_unl_to_neg=n_pos, n_components=i).labels_ for i in [0.6, 0.7, 0.8, 0.9, 0.95]]
As alternative to
PCA-based identificationof negatives,distance-based identificationcan be performed using distance metrics including ‘euclidean’, ‘manhattan’, or ‘cosine’ distance. A DataFrame with thedf_pu = dpul.fit(X=X, labels=labels, n_unl_to_neg=n_pos, metric="euclidean").df_pu_ aa.display_df(df_pu.sort_values(by="selection_via"), n_rows=10, show_shape=True)
DataFrame shape: (694, 3)
selection_via euclidean_dif euclidean_abs_dif 84 euclidean 3.480700 3.480700 505 euclidean 3.232700 3.232700 509 euclidean 3.336300 3.336300 526 euclidean 3.389700 3.389700 533 euclidean 3.363900 3.363900 542 euclidean 3.075000 3.075000 546 euclidean 3.162500 3.162500 548 euclidean 3.111900 3.111900 552 euclidean 3.288600 3.288600 553 euclidean 3.620800 3.620800 Using
PCA-based identification, ‘df_pu’ provides the principal component (PC) values for all used PC and offers a label indicating based on which PC the respective negative samples was identified on:df_pu = dpul.fit(X=X, labels=labels, n_unl_to_neg=n_pos, n_components=0.8).df_pu_ aa.display_df(df_pu.sort_values(by="selection_via"), n_rows=n_pos+1, n_cols=4, show_shape=True)
DataFrame shape: (694, 15)
selection_via PC1 (56.2%) PC2 (7.4%) PC3 (2.9%) 497 PC1 0.022500 -0.051200 0.013400 615 PC1 0.026100 -0.053300 0.099300 406 PC1 0.025400 -0.030800 0.027200 446 PC1 0.026200 -0.013700 0.054500 455 PC1 0.026600 -0.052100 0.089500 468 PC1 0.025600 -0.068800 0.011800 471 PC1 0.025000 -0.005500 0.083500 668 PC1 0.023200 -0.016900 0.076500 605 PC1 0.025800 -0.054500 0.006700 505 PC1 0.023100 -0.048400 0.033900 509 PC1 0.022800 -0.056300 0.086300 604 PC1 0.024800 -0.078800 0.027600 526 PC1 0.022500 -0.055700 0.038200 600 PC1 0.019600 -0.044200 0.092700 534 PC1 0.026100 -0.032300 -0.019400 542 PC1 0.026400 -0.039100 0.049200 545 PC1 0.026200 0.007200 0.039200 548 PC1 0.025200 -0.056900 0.039300 552 PC1 0.026000 -0.072800 -0.031000 553 PC1 0.020500 -0.077500 0.079700 336 PC1 0.026200 -0.020700 0.032200 329 PC1 0.025800 -0.014900 0.043100 624 PC1 0.026500 0.034500 0.046800 308 PC1 0.025400 -0.030600 0.033100 84 PC1 0.021000 -0.047800 0.075200 649 PC1 0.022800 -0.032400 0.108000 637 PC1 0.022600 -0.057800 0.044500 109 PC1 0.026100 -0.058500 0.075700 149 PC1 0.026500 -0.038000 0.019100 158 PC1 0.023500 -0.060700 0.054000 161 PC1 0.025900 0.031400 0.044900 169 PC1 0.026500 -0.009900 0.012500 569 PC1 0.022100 -0.043600 0.065400 170 PC1 0.026100 -0.035300 0.058300 635 PC1 0.025400 0.040600 0.054600 193 PC1 0.024700 -0.056900 0.051300 634 PC1 0.026000 -0.042200 0.007900 200 PC1 0.021200 -0.056200 0.005700 204 PC1 0.025500 -0.007100 0.062900 223 PC1 0.018800 -0.043600 0.048500 254 PC1 0.021500 -0.012900 0.071500 628 PC1 0.025600 -0.027200 0.051300 300 PC1 0.024900 -0.013500 0.052900 187 PC1 0.026100 0.018800 0.050600 585 PC1 0.022400 -0.022200 0.087800 658 PC2 0.035200 -0.081100 -0.040700 683 PC2 0.028700 -0.103200 0.011900 533 PC2 0.039300 -0.094200 -0.045800 337 PC2 0.035100 -0.102700 -0.021700 322 PC2 0.041300 -0.096300 -0.075700 95 PC2 0.032000 -0.082100 0.025800 524 PC3 0.031600 0.028400 0.106200 632 PC3 0.030100 0.022500 0.090800 81 PC3 0.033600 0.007300 0.098200 264 PC4 0.040500 0.023100 -0.024700 90 PC4 0.039000 -0.032000 -0.001300 591 PC4 0.031300 -0.004000 0.032100 195 PC5 0.029900 0.006500 0.035800 641 PC5 0.043500 0.006500 0.015200 501 PC6 0.042100 -0.018500 -0.050200 192 PC6 0.040100 -0.002200 0.004300 82 PC7 0.033400 -0.041100 0.033500 666 PC7 0.035200 0.075600 -0.011600 1 None 0.052400 0.039300 -0.066300