dPULearn

class dPULearn(model_kwargs=None, verbose=True, random_state=None)[source]

Bases: Wrapper

Deterministic Positive-Unlabeled Learning (dPULearn) class for identifying reliable negatives from unlabeled data [Breimann25].

As a Wrapper, it implements the .fit / .eval model contract.

dPULearn offers a deterministic approach to Positive-Unlabeled (PU) learning, featuring two distinct identification approaches:

PCA-based identification: This is the primary method where Principal Component Analysis (PCA) is utilized to reduce the dimensionality of the feature space. Based on the most informative principal components (PCs), the model iteratively identifies reliable negatives (labeled by 0) from the set of unlabeled samples (2). These reliable negatives are those that are most distant from the positive samples (1) in the feature space.
Distance-based identification: As a simple alternative, reliable negatives can also be identified using similarity measures like euclidean, manhattan, or cosine distance.

Added in version 0.1.0.

labels_

New dataset labels of samples in X with identified negative samples labeled by 0.

df_pu_

A DataFrame with the PCA-transformed features of ‘X’ containing the following groups of columns:

‘selection_via’: Column indicating how reliable negatives were identified (either giving the distance metric
or the i-th PC based on which the respective sample was selected).
‘PCi’: Value columns for the i-th principal component (PC).
‘PCi_abs_dif’: Absolute difference columns for each PC, representing the absolute deviation of each sample from the mean of positives.

For distance-based identification, ‘PCi’ columns are replaced with the results for the selected metric.

Parameters:

Methods

`compare_sets_negatives`(list_labels[, ...])	Create DataFrame for comparing sets of identified negatives.
`eval`(X, list_labels[, names_datasets, ...])	Evaluates the quality of different sets of identified negatives.
`fit`([X, labels, X_pos, X_unlabeled, ...])	Fit the dPULearn model to identify reliable negative samples (labeled by 0) from unlabeled samples (2) based on the distance to positive samples (1).
`project`(X[, method])	Project new samples into the fitted PCA coordinate space (the `PCi` columns of `df_pu_`).

__init__(model_kwargs=None, verbose=True, random_state=None)[source]

Parameters:

model_kwargs (dict, optional) – Additional keyword arguments for Principal Component Analysis (PCA) model.
verbose (bool, default=True) – If True, verbose outputs are enabled.
random_state (int, optional) – The seed used by the random number generator. If a positive integer, results of stochastic processes are consistent, enabling reproducibility. If None, stochastic processes will be truly random.

Notes

All attributes are set during fitting via the dPULearn.fit() method and can be directly accessed.
For a detailed discussion on Positive-Unlabeled (PU) learning, its challenges, and evaluation strategies, refer to the PU Learning section in the Usage Principles documentation: usage_principles/pu_learning.