aaanalysis.dPULearn.fit

dPULearn.fit(X, labels=None, n_unl_to_neg=None, metric=None, n_components=0.8)[source]

Fit the dPULearn model to identify reliable negative samples (labeled by 0) from unlabeled samples (2) based on the distance to positive samples (1).

Use the dPUlearn.labels_ attribute to retrieve the output labels of samples in X including identified negatives.

Parameters:
  • X (array-like, shape (n_samples, n_features)) – Feature matrix. Rows typically correspond to proteins and columns to features.

  • labels (array-like, shape (n_samples,)) – Dataset labels of samples in X. Should be either 1 (positive) or 2 (unlabeled).

  • n_unl_to_neg (int, default=1) – Number of negative samples (0) to be reliably identified from unlabeled samples (2). Should be < n unlabeled samples.

  • metric (str or None, optional) –

    The distance metric to use. If None, PCA-based identification is performed. For distance-based identification one of the following measures can be selected:

    • euclidean: Euclidean distance (minimum)

    • manhattan: Manhattan distance (minimum)

    • cosine: Cosine distance (minimum)

  • n_components (int or float, default=0.80) –

    Number of principal components (a) or the percentage of total variance to be covered (b) when PCA is applied.

    • In case (a): it should be an integer >= 1.

    • In case (b): it should be a float with 0.0 < n_components < 1.0.

Returns:

The fitted instance of the dPULearn class, allowing direct attribute access.

Return type:

dPULearn

Notes

  • If a distance metric is specified, dPUlearn performs distance-based instead of PCA-based identification.

  • When selecting a distance metric for distance-based identification, consider the dimensionality of the feature space, determined by the ratio of the number of features (n_features) to the number of samples (n_samples) in X. In a low-dimensional space, there are fewer features than samples (n_features < n_samples), whereas a high-dimensional space has significantly more features than samples (n_features >> n_samples). The choice of metric depends on the specific application, with the following general guidelines:

    • euclidean: Effective in low-dimensional spaces or when direct distances are meaningful.

    • manhattan: Useful when differences along individual dimensions are important, or in the presence of outliers.

    • cosine: Recommended for high-dimensional spaces (e.g., n_features >> n_samples), as it evaluates the direction of feature vectors between data points rather than the magnitude of their differences.

Warning

  • When setting n_components as a percentage of total variance (i.e., a float between 0.0 and 1.0), caution is needed if the explained variance per principal component (PC) is low. Selecting too many PCs with low explained variance may introduce noise and lead to the selection of outliers rather than true negatives.

  • To mitigate this, users can alternatively set n_components as an integer (≥1) to explicitly limit the number of PCs used.

See also

  • See scikit-learn for details the three different pairwise distance measures.

  • See [Hastie09] for a detailed explanation on feature space and high-dimensional problems.

Examples

To demonstrate the dPULearn().fit()method, we create a small example dataset containing positive (1) and unlabeled (2) data samples:

import aaanalysis as aa
import pandas as pd
import numpy as np
aa.options["verbose"] = False

X = np.array([[0.2, 0.1], [0.25, 0.2], [0.2, 0.3], [0.5, 0.7]])
labels = np.array([1, 2, 2, 2])

Use dPULearn with default Principal Component Analysis (PCA) to obtain a defined number of reliable negatives samples (0) by specifying the n_unl_to_neg parameter:

dpul = aa.dPULearn()
dpul.fit(X=X, labels=labels, n_unl_to_neg=1)
df_pu = dpul.df_pu_
labels = dpul.labels_ # Updated labels
aa.display_df(df_pu)
  selection_via PC1 (100.0%) PC1 (100.0%)_abs_dif
1 None -0.400000 0.000000
2 None -0.200000 0.200000
3 None 0.400000 0.800000
4 PC1 0.800000 1.200000

As a real-world example, you can load our γ-secretase substrate prediction dataset containing substrates (positive samples, 1) and a redundancy-reduced set of single-span type I transmembrane proteins with unknown substrates status (unlabeled samples, 2):

df_seq = aa.load_dataset(name="DOM_GSEC_PU")
labels = df_seq["label"].to_numpy()
n_pos = sum([x == 1 for x in labels])   # Get number of positive samples
aa.display_df(df=df_seq.tail(5), show_shape=True, n_cols=5)
DataFrame shape: (5, 8)
  entry sequence label tmd_start tmd_stop
690 P60852 MAGGSATTWGYPVAL...LSQTWAQKLWESNRQ 2 602 624
691 P20239 MARWQRKASVSSPCG...FICYLYKKRTIRFNH 2 684 703
692 P21754 MELSYRLFICLLLWG...TRRCRTASHPVSASE 2 387 409
693 Q12836 MWLLRCVLLCVSLSL...LAVKKQKSCPDQMCQ 2 506 528
694 Q8TCW7 MEQIWLLLLLTIRVL...PTSLVLNGIRNPVFD 2 374 396

Using the respective features, we can create a feature matrix and obtain ‘reliable’ non-substrates by dPULearn:

df_feat = aa.load_features(name="DOM_GSEC")
sf = aa.SequenceFeature()
df_parts = sf.get_df_parts(df_seq=df_seq)
X = sf.feature_matrix(features=df_feat["feature"], df_parts=df_parts)

# Number of positive (1) and unlabeled (2) samples
print(pd.Series(labels).value_counts())
dpul.fit(X=X, labels=labels, n_unl_to_neg=n_pos)
df_pu = dpul.df_pu_
new_labels = dpul.labels_

# Number of updated labels containing reliable negatives (0)
print(pd.Series(new_labels).value_counts())

# Show only selected entries
df = df_pu[df_pu["selection_via"].str.contains("PC", na=False)]
aa.display_df(df=df, show_shape=True, n_rows=20, n_cols=5)
2    631
1     63
Name: count, dtype: int64
2    568
1     63
0     63
Name: count, dtype: int64
DataFrame shape: (63, 15)
  selection_via PC1 (56.2%) PC2 (7.4%) PC3 (2.9%) PC4 (2.8%)
81 PC3 0.033600 0.007300 0.098200 -0.007800
82 PC7 0.033400 -0.041100 0.033500 -0.005200
84 PC1 0.021000 -0.047800 0.075200 -0.005400
90 PC4 0.039000 -0.032000 -0.001300 0.110900
95 PC2 0.032000 -0.082100 0.025800 -0.037700
109 PC1 0.026100 -0.058500 0.075700 -0.020900
149 PC1 0.026500 -0.038000 0.019100 0.045500
158 PC1 0.023500 -0.060700 0.054000 0.000900
161 PC1 0.025900 0.031400 0.044900 0.055400
169 PC1 0.026500 -0.009900 0.012500 -0.016700
170 PC1 0.026100 -0.035300 0.058300 0.025800
187 PC1 0.026100 0.018800 0.050600 0.038600
192 PC6 0.040100 -0.002200 0.004300 -0.053600
193 PC1 0.024700 -0.056900 0.051300 -0.035600
195 PC5 0.029900 0.006500 0.035800 0.050200
200 PC1 0.021200 -0.056200 0.005700 0.072600
204 PC1 0.025500 -0.007100 0.062900 -0.052500
223 PC1 0.018800 -0.043600 0.048500 -0.072700
254 PC1 0.021500 -0.012900 0.071500 0.038500
264 PC4 0.040500 0.023100 -0.024700 0.113800

Since dPULearn().fit() returns the fitted model, list comprehension can be utilized to create results for various settings of a n_componentes. If given as a float > 0 and < 1, this parameter represents the percentage of total variance to be retained by principal component analysis (PCA).

list_labels = [dpul.fit(X=X, labels=labels, n_unl_to_neg=n_pos, n_components=i).labels_ for i in [0.6, 0.7, 0.8, 0.9, 0.95]]

As alternative to PCA-based identification of negatives, distance-based identification can be performed using distance metrics including ‘euclidean’, ‘manhattan’, or ‘cosine’ distance. A DataFrame with the

df_pu = dpul.fit(X=X, labels=labels, n_unl_to_neg=n_pos, metric="euclidean").df_pu_
aa.display_df(df_pu.sort_values(by="selection_via"), n_rows=10, show_shape=True)
DataFrame shape: (694, 3)
  selection_via euclidean_dif euclidean_abs_dif
84 euclidean 3.480700 3.480700
505 euclidean 3.232700 3.232700
509 euclidean 3.336300 3.336300
526 euclidean 3.389700 3.389700
533 euclidean 3.363900 3.363900
542 euclidean 3.075000 3.075000
546 euclidean 3.162500 3.162500
548 euclidean 3.111900 3.111900
552 euclidean 3.288600 3.288600
553 euclidean 3.620800 3.620800

Using PCA-based identification, ‘df_pu’ provides the principal component (PC) values for all used PC and offers a label indicating based on which PC the respective negative samples was identified on:

df_pu = dpul.fit(X=X, labels=labels, n_unl_to_neg=n_pos, n_components=0.8).df_pu_
aa.display_df(df_pu.sort_values(by="selection_via"), n_rows=n_pos+1, n_cols=4, show_shape=True)
DataFrame shape: (694, 15)
  selection_via PC1 (56.2%) PC2 (7.4%) PC3 (2.9%)
497 PC1 0.022500 -0.051200 0.013400
615 PC1 0.026100 -0.053300 0.099300
406 PC1 0.025400 -0.030800 0.027200
446 PC1 0.026200 -0.013700 0.054500
455 PC1 0.026600 -0.052100 0.089500
468 PC1 0.025600 -0.068800 0.011800
471 PC1 0.025000 -0.005500 0.083500
668 PC1 0.023200 -0.016900 0.076500
605 PC1 0.025800 -0.054500 0.006700
505 PC1 0.023100 -0.048400 0.033900
509 PC1 0.022800 -0.056300 0.086300
604 PC1 0.024800 -0.078800 0.027600
526 PC1 0.022500 -0.055700 0.038200
600 PC1 0.019600 -0.044200 0.092700
534 PC1 0.026100 -0.032300 -0.019400
542 PC1 0.026400 -0.039100 0.049200
545 PC1 0.026200 0.007200 0.039200
548 PC1 0.025200 -0.056900 0.039300
552 PC1 0.026000 -0.072800 -0.031000
553 PC1 0.020500 -0.077500 0.079700
336 PC1 0.026200 -0.020700 0.032200
329 PC1 0.025800 -0.014900 0.043100
624 PC1 0.026500 0.034500 0.046800
308 PC1 0.025400 -0.030600 0.033100
84 PC1 0.021000 -0.047800 0.075200
649 PC1 0.022800 -0.032400 0.108000
637 PC1 0.022600 -0.057800 0.044500
109 PC1 0.026100 -0.058500 0.075700
149 PC1 0.026500 -0.038000 0.019100
158 PC1 0.023500 -0.060700 0.054000
161 PC1 0.025900 0.031400 0.044900
169 PC1 0.026500 -0.009900 0.012500
569 PC1 0.022100 -0.043600 0.065400
170 PC1 0.026100 -0.035300 0.058300
635 PC1 0.025400 0.040600 0.054600
193 PC1 0.024700 -0.056900 0.051300
634 PC1 0.026000 -0.042200 0.007900
200 PC1 0.021200 -0.056200 0.005700
204 PC1 0.025500 -0.007100 0.062900
223 PC1 0.018800 -0.043600 0.048500
254 PC1 0.021500 -0.012900 0.071500
628 PC1 0.025600 -0.027200 0.051300
300 PC1 0.024900 -0.013500 0.052900
187 PC1 0.026100 0.018800 0.050600
585 PC1 0.022400 -0.022200 0.087800
658 PC2 0.035200 -0.081100 -0.040700
683 PC2 0.028700 -0.103200 0.011900
533 PC2 0.039300 -0.094200 -0.045800
337 PC2 0.035100 -0.102700 -0.021700
322 PC2 0.041300 -0.096300 -0.075700
95 PC2 0.032000 -0.082100 0.025800
524 PC3 0.031600 0.028400 0.106200
632 PC3 0.030100 0.022500 0.090800
81 PC3 0.033600 0.007300 0.098200
264 PC4 0.040500 0.023100 -0.024700
90 PC4 0.039000 -0.032000 -0.001300
591 PC4 0.031300 -0.004000 0.032100
195 PC5 0.029900 0.006500 0.035800
641 PC5 0.043500 0.006500 0.015200
501 PC6 0.042100 -0.018500 -0.050200
192 PC6 0.040100 -0.002200 0.004300
82 PC7 0.033400 -0.041100 0.033500
666 PC7 0.035200 0.075600 -0.011600
1 None 0.052400 0.039300 -0.066300