aaanalysis.dPULearn
- class aaanalysis.dPULearn(model_kwargs=None, verbose=True, random_state=None)[source]
Bases:
objectDeterministic Positive-Unlabeled Learning (dPULearn) class for identifying reliable negatives from unlabeled data [Breimann25a].
dPULearn offers a deterministic approach to Positive-Unlabeled (PU) learning, featuring two distinct identification approaches:
PCA-based identification: This is the primary method where Principal Component Analysis (PCA) is utilized to reduce the dimensionality of the feature space. Based on the most informative principal components (PCs), the model iteratively identifies reliable negatives (labeled by 0) from the set of unlabeled samples (2). These reliable negatives are those that are most distant from the positive samples (1) in the feature space.
Distance-based identification: As a simple alternative, reliable negatives can also be identified using similarity measures like
euclidean,manhattan, orcosinedistance.
Added in version 0.1.0.
- labels_
New dataset labels of samples in
Xwith identified negative samples labeled by 0.- Type:
array-like, shape (n_samples,)
- df_pu_
A DataFrame with the PCA-transformed features of ‘X’ containing the following groups of columns:
- ‘selection_via’: Column indicating how reliable negatives were identified (either giving the distance metric
or the i-th PC based on which the respective sample was selected).
‘PCi’: Value columns for the i-th principal component (PC).
‘PCi_abs_dif’: Absolute difference columns for each PC, representing the absolute deviation of each sample from the mean of positives.
For distance-based identification, ‘PCi’ columns are replaced with the results for the selected metric.
- Type:
pd.DataFrame, shape (n_samples, pca_features)
Methods
compare_sets_negatives([list_labels, ...])Create DataFrame for comparing sets of identified negatives.
eval(X[, list_labels, names_datasets, ...])Evaluates the quality of different sets of identified negatives.
fit(X[, labels, n_unl_to_neg, metric, ...])Fit the dPULearn model to identify reliable negative samples (labeled by 0) from unlabeled samples (2) based on the distance to positive samples (1).
- __init__(model_kwargs=None, verbose=True, random_state=None)[source]
- Parameters:
model_kwargs (dict, optional) – Additional keyword arguments for Principal Component Analysis (PCA) model.
verbose (bool, default=True) – If
True, verbose outputs are enabled.random_state (int, optional) – The seed used by the random number generator. If a positive integer, results of stochastic processes are consistent, enabling reproducibility. If
None, stochastic processes will be truly random.
Notes
All attributes are set during fitting via the
dPULearn.fit()method and can be directly accessed.For a detailed discussion on Positive-Unlabeled (PU) learning, its challenges, and evaluation strategies, refer to the PU Learning section in the Usage Principles documentation: usage_principles/pu_learning.
See also
dPULearnPlot: the respective plotting class.sklearn.decomposition.PCA()for details on principal component analysis.