dPULearn.fit

dPULearn.fit(X=None, labels=None, X_pos=None, X_unlabeled=None, label_pos=1, label_unl=2, label_neg=None, n_neg=None, n_unl_to_neg=None, metric=None, n_components=0.8)[source]

Fit the dPULearn model to identify reliable negative samples (labeled by 0) from unlabeled samples (2) based on the distance to positive samples (1).

Only unlabeled samples are candidates for reclassification; any pre-labeled negatives provided via label_neg are kept as negatives and are never re-selected. Specify the count in one of two ways (exactly one): n_neg as the total number of negatives wanted (dPULearn identifies n_neg minus the pre-labeled negatives), or n_unl_to_neg to set the number identified directly from the unlabeled pool.

Use the dPULearn.labels_ attribute to retrieve the output labels of samples in X including identified negatives. Output labels always use the package convention (1 = positive, 0 = reliable negative, 2 = remaining unlabeled), regardless of the input markers.

Added in version 0.1.0.

There are two input modes (provide exactly one): pass X + labels (a single feature matrix with per-sample markers), or — for the common positives-vs-unlabeled setup — pass the two matrices X_pos and X_unlabeled separately, which are stacked internally with the package markers. Either way, after fitting dPULearn.mask_neg_ is the boolean mask of reliable negatives (over X_unlabeled in the split mode, over X otherwise).

Parameters:

X (array-like, shape (n_samples, n_features), optional) – Feature matrix. Rows typically correspond to proteins and columns to features. Provide X + labels, or X_pos + X_unlabeled (exactly one of the two modes).
labels (array-like, shape (n_samples,), optional) – Dataset labels of samples in X. Must contain the positive marker (label_pos) and the unlabeled marker (label_unl); pre-labeled negatives (label_neg) are optional. By default positives are 1 and unlabeled are 2; set label_unl=0 to pass the standard {0, 1} encoding directly (0 = unlabeled, 1 = positive).
X_pos (array-like, shape (n_pos, n_features), optional) – Feature matrix of the positive samples (split-input mode). Provided together with X_unlabeled instead of X + labels; the two are stacked and marked internally (positives label_pos, unlabeled label_unl), so no manual label vector is needed.
X_unlabeled (array-like, shape (n_unl, n_features), optional) – Feature matrix of the unlabeled candidate pool (split-input mode). Must have the same number of features as X_pos. After fitting, dPULearn.mask_neg_ is a boolean mask over its rows marking the identified reliable negatives.
label_pos (int, default=1) – Value marking positive samples in labels. Must be present.
label_unl (int, default=2) – Value marking unlabeled samples in labels (the candidate pool). Must be present. Set label_unl=0 to pass the standard {0, 1} encoding (0 = unlabeled, 1 = positive) without re-encoding.
label_neg (int or None, default=None) – Value marking pre-labeled (already known) negatives in labels. When given, those samples are kept as negatives and never re-selected, and dPULearn only identifies the remaining (n_neg minus pre-labeled) negatives from the unlabeled pool. None means labels contains no pre-labeled negatives. Must differ from label_pos / label_unl.
n_neg (int, optional) – Total number of negatives (0) wanted in the output: any pre-labeled negatives (label_neg) plus the newly identified reliable negatives add up to n_neg. So dPULearn identifies n_neg minus the pre-labeled negatives (with no pre-labeled negatives it identifies exactly n_neg). It must exceed the number of pre-labeled negatives. Provide exactly one of n_neg or n_unl_to_neg.
n_unl_to_neg (int, optional) – Number of reliable negatives to identify directly from the unlabeled pool — direct control over how many unlabeled samples are reclassified, independent of any pre-labeled negatives (final negatives = pre-labeled + n_unl_to_neg). Provide exactly one of n_neg or n_unl_to_neg. With no pre-labeled negatives the two are equivalent.
metric (str or None, optional) –
The distance metric to use. If None, Principal Component Analysis (PCA)-based identification is performed. For distance-based identification one of the following measures can be selected:
- euclidean: Euclidean distance (minimum)
- manhattan: Manhattan distance (minimum)
- cosine: Cosine distance (minimum)
n_components (int or float, default=0.80) –
Number of principal components (a) or the percentage of total variance to be covered (b) when PCA is applied.
- In case (a): it should be an integer >= 1.
- In case (b): it should be a float with 0.0 < n_components < 1.0.

Returns:

The fitted instance of the dPULearn class, allowing direct attribute access.

Return type:

dPULearn

Notes

If a distance metric is specified, dPULearn performs distance-based instead of PCA-based identification.
When selecting a distance metric for distance-based identification, consider the dimensionality of the feature space, determined by the ratio of the number of features (n_features) to the number of samples (n_samples) in X. In a low-dimensional space, there are fewer features than samples (n_features < n_samples), whereas a high-dimensional space has significantly more features than samples (n_features >> n_samples). The choice of metric depends on the specific application, with the following general guidelines:
- euclidean: Effective in low-dimensional spaces or when direct distances are meaningful.
- manhattan: Useful when differences along individual dimensions are important, or in the presence of outliers.
- cosine: Recommended for high-dimensional spaces (e.g., n_features >> n_samples), as it evaluates the direction of feature vectors between data points rather than the magnitude of their differences.

Warning

When setting n_components as a percentage of total variance (i.e., a float between 0.0 and 1.0), caution is needed if the explained variance per principal component (PC) is low. Selecting too many PCs with low explained variance may introduce noise and lead to the selection of outliers rather than true negatives.
To mitigate this, users can alternatively set n_components as an integer (≥1) to explicitly limit the number of PCs used.

See also

See scikit-learn for details the three different pairwise distance measures.
See [Hastie09] for a detailed explanation on feature space and high-dimensional problems.

Examples

dPULearn().fit() identifies reliable negatives (labeled 0) from unlabeled samples, based on their distance to the positives. There are two ways to provide the input (choose one):

Option 1 — ``X`` + ``labels``: a single feature matrix with a per-sample label vector (positives 1, unlabeled 2; pre-labeled negatives optional). The general, flexible form — the input encoding is configurable via label_pos / label_unl / label_neg.
Option 2 — ``X_pos`` + ``X_unlabeled``: the positives and the unlabeled pool passed as two separate matrices (the common positives-vs-unlabeled setup). They are stacked internally, so no manual label vector is needed, and the mined negatives are exposed as the mask_neg_ attribute.

Either way fit returns the fitted model. We start with Option 1, creating a small example dataset with positive (1) and unlabeled (2) samples:

import aaanalysis as aa
import pandas as pd
import numpy as np
aa.options["verbose"] = False

X = np.array([[0.2, 0.1], [0.25, 0.2], [0.2, 0.3], [0.5, 0.7]])
labels = np.array([1, 2, 2, 2])

Use dPULearn with default Principal Component Analysis (PCA) to obtain a defined number of reliable negative samples (0) by specifying the n_neg parameter:

dpul = aa.dPULearn()
dpul.fit(X=X, labels=labels, n_neg=1)
df_pu = dpul.df_pu_
labels = dpul.labels_ # Updated labels
aa.display_df(df_pu)

	selection_via	PC1 (100.0%)	PC1 (100.0%)_abs_dif
1	nan	-0.400000	0.000000
2	nan	-0.200000	0.200000
3	nan	0.400000	0.800000
4	PC1	0.800000	1.200000

dPULearn recommends the PU encoding (positives 1, unlabeled 2), but the input encoding is flexible via the label_pos, label_unl, and label_neg markers. For example, set label_unl=0 to pass the standard {0, 1} encoding directly (0 = unlabeled, 1 = positive). The labels are normalized internally, so the result is identical to the PU-encoded call:

X = np.array([[0.2, 0.1], [0.25, 0.2], [0.2, 0.3], [0.5, 0.7]])
labels_01 = np.array([1, 0, 0, 0]) # Standard encoding: 1 = positive, 0 = unlabeled

dpul = aa.dPULearn()
dpul.fit(X=X, labels=labels_01, label_unl=0, n_neg=1)
aa.display_df(dpul.df_pu_)

	selection_via	PC1 (100.0%)	PC1 (100.0%)_abs_dif
1	nan	-0.400000	0.000000
2	nan	-0.200000	0.200000
3	nan	0.400000	0.800000
4	PC1	0.800000	1.200000

If some negatives are already known, mark them with label_neg. Those samples are kept as negatives and are never re-selected — only unlabeled samples are candidates. n_neg is then the total number of negatives wanted, so dPULearn identifies n_neg minus the pre-labeled negatives:

X = np.array([[0.2, 0.1], [0.25, 0.2], [0.2, 0.3], [0.5, 0.7], [0.6, 0.8], [0.1, 0.15]])
# 1 = positive, 0 = pre-labeled (known) negative, 2 = unlabeled
labels_mixed = np.array([1, 0, 2, 2, 2, 2])

dpul = aa.dPULearn()
# n_neg is the TOTAL number of negatives wanted: 1 pre-labeled + 2 newly identified
dpul.fit(X=X, labels=labels_mixed, label_neg=0, n_neg=3)
print(pd.Series(dpul.labels_).value_counts())  # 1 positive, 3 negatives (0), 2 unlabeled (2)
aa.display_df(dpul.df_pu_)

  3
  2
  1
Name: count, dtype: int64

	selection_via	PC1 (100.0%)	PC1 (100.0%)_abs_dif
1	nan	-0.308600	0.000000
2	nan	-0.154300	0.154300
3	nan	0.308600	0.617200
4	PC1	0.617200	0.925800
5	PC1	0.617200	0.925800
6	nan	0.154300	0.462900

n_neg is the total number of negatives wanted (pre-labeled plus newly identified). Alternatively, use n_unl_to_neg to control the number identified directly from the unlabeled pool, independent of any pre-labeled negatives (final negatives = pre-labeled + n_unl_to_neg). Provide exactly one of the two.

Option 2 — ``X_pos`` + ``X_unlabeled`` (positives/unlabeled split). For the common positives-vs-unlabeled setup, pass the two matrices to fit separately instead of stacking them and building a 1/2 label vector by hand. After fitting, dPULearn.mask_neg_ is the boolean mask of reliable negatives over the rows of X_unlabeled:

X_pos = np.array([[0.2, 0.1], [0.25, 0.2]])
X_unlabeled = np.array([[0.2, 0.3], [0.5, 0.7], [0.6, 0.8], [0.1, 0.15]])

dpul = aa.dPULearn()
dpul.fit(X_pos=X_pos, X_unlabeled=X_unlabeled, n_neg=2)
mask_neg = dpul.mask_neg_            # boolean mask over X_unlabeled (True = reliable negative)
X_neg = X_unlabeled[mask_neg]        # the mined reliable negatives
print(f"{X_neg.shape[0]} reliable negatives mined from {len(X_unlabeled)} unlabeled samples")

2 reliable negatives mined from 4 unlabeled samples

As a real-world example, you can load our γ-secretase substrate prediction dataset containing substrates (positive samples, 1) and a redundancy-reduced set of single-span type I transmembrane proteins with unknown substrates status (unlabeled samples, 2):

df_seq = aa.load_dataset(name="DOM_GSEC_PU")
labels = df_seq["label"].to_numpy()
n_pos = sum([x == 1 for x in labels])   # Get number of positive samples
aa.display_df(df=df_seq.tail(5), show_shape=True, n_cols=5)

DataFrame shape: (5, 8)

	entry	sequence	label	tmd_start	tmd_stop
690	P60852	MAGGSATTWGYPVAL...LSQTWAQKLWESNRQ	2	602	624
691	P20239	MARWQRKASVSSPCG...FICYLYKKRTIRFNH	2	684	703
692	P21754	MELSYRLFICLLLWG...TRRCRTASHPVSASE	2	387	409
693	Q12836	MWLLRCVLLCVSLSL...LAVKKQKSCPDQMCQ	2	506	528
694	Q8TCW7	MEQIWLLLLLTIRVL...PTSLVLNGIRNPVFD	2	374	396

Using the respective features, we can create a feature matrix and obtain ‘reliable’ non-substrates by dPULearn:

df_feat = aa.load_features(name="DOM_GSEC")
sf = aa.SequenceFeature()
df_parts = sf.get_df_parts(df_seq=df_seq)
X = sf.feature_matrix(features=df_feat["feature"], df_parts=df_parts)

# Number of positive (1) and unlabeled (2) samples
print(pd.Series(labels).value_counts())
dpul.fit(X=X, labels=labels, n_neg=n_pos)
df_pu = dpul.df_pu_
new_labels = dpul.labels_

# Number of updated labels containing reliable negatives (0)
print(pd.Series(new_labels).value_counts())

# Show only selected entries
df = df_pu[df_pu["selection_via"].str.contains("PC", na=False)]
aa.display_df(df=df, show_shape=True, n_rows=20, n_cols=5)

2    631
1     63
Name: count, dtype: int64
2    568
1     63
0     63
Name: count, dtype: int64
DataFrame shape: (63, 15)

	selection_via	PC1 (56.2%)	PC2 (7.4%)	PC3 (2.9%)	PC4 (2.8%)
81	PC3	0.033600	0.007300	0.098200	-0.007800
82	PC7	0.033400	-0.041100	0.033500	-0.005200
84	PC1	0.021000	-0.047800	0.075200	-0.005400
90	PC4	0.039000	-0.032000	-0.001300	0.110900
95	PC2	0.032000	-0.082100	0.025800	-0.037700
109	PC1	0.026100	-0.058500	0.075700	-0.020900
149	PC1	0.026500	-0.038000	0.019100	0.045500
158	PC1	0.023500	-0.060700	0.054000	0.000900
161	PC1	0.025900	0.031400	0.044900	0.055400
169	PC1	0.026500	-0.009900	0.012500	-0.016700
170	PC1	0.026100	-0.035300	0.058300	0.025800
187	PC1	0.026100	0.018800	0.050600	0.038600
192	PC6	0.040100	-0.002200	0.004300	-0.053600
193	PC1	0.024700	-0.056900	0.051300	-0.035600
195	PC5	0.029900	0.006500	0.035800	0.050200
200	PC1	0.021200	-0.056200	0.005700	0.072600
204	PC1	0.025500	-0.007100	0.062900	-0.052500
223	PC1	0.018800	-0.043600	0.048500	-0.072700
254	PC1	0.021500	-0.012900	0.071500	0.038500
264	PC4	0.040500	0.023100	-0.024700	0.113800

Since dPULearn().fit() returns the fitted model, list comprehension can be utilized to create results for various settings of a n_componentes. If given as a float > 0 and < 1, this parameter represents the percentage of total variance to be retained by principal component analysis (PCA).

list_labels = [dpul.fit(X=X, labels=labels, n_neg=n_pos, n_components=i).labels_ for i in [0.6, 0.7, 0.8, 0.9, 0.95]]

As alternative to PCA-based identification of negatives, distance-based identification can be performed using distance metrics including ‘euclidean’, ‘manhattan’, or ‘cosine’ distance. A DataFrame with the

df_pu = dpul.fit(X=X, labels=labels, n_neg=n_pos, metric="euclidean").df_pu_
aa.display_df(df_pu.sort_values(by="selection_via"), n_rows=10, show_shape=True)

DataFrame shape: (694, 3)

	selection_via	euclidean_dif	euclidean_abs_dif
84	euclidean	3.480700	3.480700
505	euclidean	3.232700	3.232700
509	euclidean	3.336300	3.336300
526	euclidean	3.389700	3.389700
533	euclidean	3.363900	3.363900
542	euclidean	3.075000	3.075000
546	euclidean	3.162500	3.162500
548	euclidean	3.111900	3.111900
552	euclidean	3.288600	3.288600
553	euclidean	3.620800	3.620800

Using PCA-based identification, ‘df_pu’ provides the principal component (PC) values for all used PC and offers a label indicating based on which PC the respective negative samples was identified on:

df_pu = dpul.fit(X=X, labels=labels, n_neg=n_pos, n_components=0.8).df_pu_
aa.display_df(df_pu.sort_values(by="selection_via"), n_rows=n_pos+1, n_cols=4, show_shape=True)

DataFrame shape: (694, 15)

	selection_via	PC1 (56.2%)	PC2 (7.4%)	PC3 (2.9%)
497	PC1	0.022500	-0.051200	0.013400
615	PC1	0.026100	-0.053300	0.099300
406	PC1	0.025400	-0.030800	0.027200
446	PC1	0.026200	-0.013700	0.054500
455	PC1	0.026600	-0.052100	0.089500
468	PC1	0.025600	-0.068800	0.011800
471	PC1	0.025000	-0.005500	0.083500
668	PC1	0.023200	-0.016900	0.076500
605	PC1	0.025800	-0.054500	0.006700
505	PC1	0.023100	-0.048400	0.033900
509	PC1	0.022800	-0.056300	0.086300
604	PC1	0.024800	-0.078800	0.027600
526	PC1	0.022500	-0.055700	0.038200
600	PC1	0.019600	-0.044200	0.092700
534	PC1	0.026100	-0.032300	-0.019400
542	PC1	0.026400	-0.039100	0.049200
545	PC1	0.026200	0.007200	0.039200
548	PC1	0.025200	-0.056900	0.039300
552	PC1	0.026000	-0.072800	-0.031000
553	PC1	0.020500	-0.077500	0.079700
336	PC1	0.026200	-0.020700	0.032200
329	PC1	0.025800	-0.014900	0.043100
624	PC1	0.026500	0.034500	0.046800
308	PC1	0.025400	-0.030600	0.033100
84	PC1	0.021000	-0.047800	0.075200
649	PC1	0.022800	-0.032400	0.108000
637	PC1	0.022600	-0.057800	0.044500
109	PC1	0.026100	-0.058500	0.075700
149	PC1	0.026500	-0.038000	0.019100
158	PC1	0.023500	-0.060700	0.054000
161	PC1	0.025900	0.031400	0.044900
169	PC1	0.026500	-0.009900	0.012500
569	PC1	0.022100	-0.043600	0.065400
170	PC1	0.026100	-0.035300	0.058300
635	PC1	0.025400	0.040600	0.054600
193	PC1	0.024700	-0.056900	0.051300
634	PC1	0.026000	-0.042200	0.007900
200	PC1	0.021200	-0.056200	0.005700
204	PC1	0.025500	-0.007100	0.062900
223	PC1	0.018800	-0.043600	0.048500
254	PC1	0.021500	-0.012900	0.071500
628	PC1	0.025600	-0.027200	0.051300
300	PC1	0.024900	-0.013500	0.052900
187	PC1	0.026100	0.018800	0.050600
585	PC1	0.022400	-0.022200	0.087800
658	PC2	0.035200	-0.081100	-0.040700
683	PC2	0.028700	-0.103200	0.011900
533	PC2	0.039300	-0.094200	-0.045800
337	PC2	0.035100	-0.102700	-0.021700
322	PC2	0.041300	-0.096300	-0.075700
95	PC2	0.032000	-0.082100	0.025800
524	PC3	0.031600	0.028400	0.106200
632	PC3	0.030100	0.022500	0.090800
81	PC3	0.033600	0.007300	0.098200
264	PC4	0.040500	0.023100	-0.024700
90	PC4	0.039000	-0.032000	-0.001300
591	PC4	0.031300	-0.004000	0.032100
195	PC5	0.029900	0.006500	0.035800
641	PC5	0.043500	0.006500	0.015200
501	PC6	0.042100	-0.018500	-0.050200
192	PC6	0.040100	-0.002200	0.004300
82	PC7	0.033400	-0.041100	0.033500
666	PC7	0.035200	0.075600	-0.011600
1	nan	0.052400	0.039300	-0.066300

Further parameters. dPULearn.fit also accepts label_pos — the marker for the positive class in labels (default 1) — and n_unl_to_neg — the number of reliable negatives to mine directly from the unlabeled pool (an alternative to n_neg; provide exactly one of the two).

# Further parameters: label_pos marks the positive class in `labels`, and n_unl_to_neg sets how
# many reliable negatives to mine directly from the unlabeled pool (alternative to n_neg).
X_small = np.array([[0.2, 0.1], [0.25, 0.2], [0.2, 0.3], [0.5, 0.7], [0.6, 0.8], [0.1, 0.15]])
labels_small = np.array([1, 2, 2, 2, 2, 2])  # 1 = positive, 2 = unlabeled

dpul = aa.dPULearn()
dpul.fit(X=X_small, labels=labels_small, label_pos=1, n_unl_to_neg=2)
print(pd.Series(dpul.labels_).value_counts())  # 1 positive, 2 mined negatives (0), rest unlabeled (2)
aa.display_df(dpul.df_pu_, show_shape=True)

2    3
0    2
1    1
Name: count, dtype: int64
DataFrame shape: (6, 3)

	selection_via	PC1 (100.0%)	PC1 (100.0%)_abs_dif
1	nan	-0.308600	0.000000
2	nan	-0.154300	0.154300
3	nan	0.308600	0.617200
4	PC1	0.617200	0.925800
5	PC1	0.617200	0.925800
6	nan	0.154300	0.462900