Kumar Praveen, Lambert Christophe G
Department of Internal Medicine, Division of Translational Informatics, University of New Mexico, Albuquerque, United States.
PeerJ Comput Sci. 2024 Nov 5;10:e2451. doi: 10.7717/peerj-cs.2451. eCollection 2024.
Positive and unlabeled (PU) learning is a type of semi-supervised binary classification where the machine learning algorithm differentiates between a set of positive instances (labeled) and a set of both positive and negative instances (unlabeled). PU learning has broad applications in settings where confirmed negatives are unavailable or difficult to obtain, and there is value in discovering positives among the unlabeled (., viable drugs among untested compounds). Most PU learning algorithms make the selected completely at random (SCAR) assumption, namely that positives are selected independently of their features. However, in many real-world applications, such as healthcare, positives are not SCAR (., severe cases are more likely to be diagnosed), leading to a poor estimate of the proportion, α, of positives among unlabeled examples and poor model calibration, resulting in an uncertain decision threshold for selecting positives. PU learning algorithms vary; some estimate only the proportion, α, of positives in the unlabeled set, while others calculate the probability that each specific unlabeled instance is positive, and some can do both. We propose two PU learning algorithms to estimate α, calculate calibrated probabilities for PU instances, and improve classification metrics: i) PULSCAR (positive unlabeled learning selected completely at random), and ii) PULSNAR (positive unlabeled learning selected not at random). PULSNAR employs a divide-and-conquer approach to cluster SNAR positives into subtypes and estimates α for each subtype by applying PULSCAR to positives from each cluster and all unlabeled. In our experiments, PULSNAR outperformed state-of-the-art approaches on both synthetic and real-world benchmark datasets.
正例与无标签(PU)学习是一种半监督二分类方法,其中机器学习算法要区分一组正例(有标签)和一组正负例混合的实例(无标签)。PU学习在难以获取或无法获取已确认负例的场景中有广泛应用,并且在无标签数据中发现正例(如在未测试化合物中发现有效药物)具有重要价值。大多数PU学习算法采用完全随机选择(SCAR)假设,即正例是独立于其特征被选择的。然而,在许多实际应用中,如医疗保健领域,正例并非完全随机选择(例如,严重病例更有可能被诊断出来),这导致对无标签示例中正例比例α的估计不准确,模型校准效果不佳,从而导致选择正例的决策阈值不确定。PU学习算法各不相同;有些只估计无标签集中正例的比例α,有些则计算每个特定无标签实例为正例的概率,还有些两者都能做到。我们提出了两种PU学习算法来估计α,计算PU实例的校准概率,并改善分类指标:i)PULSCAR(完全随机选择的正例与无标签学习)和ii)PULSNAR(非随机选择的正例与无标签学习)。PULSNAR采用分治法将SNAR正例聚类为子类型,并通过将PULSCAR应用于每个聚类中的正例和所有无标签数据来估计每个子类型的α。在我们的实验中,PULSNAR在合成数据集和真实世界基准数据集上均优于现有方法。