一种新型分类方法：基于邻域的使用决策树的正例未标注学习（NPULUD）。

A Novel Classification Method: Neighborhood-Based Positive Unlabeled Learning Using Decision Tree (NPULUD).

作者信息

Ghasemkhani Bita, Balbal Kadriye Filiz, Birant Kokten Ulas, Birant Derya

机构信息

Graduate School of Natural and Applied Sciences, Dokuz Eylul University, Izmir 35390, Turkey.

Department of Computer Science, Dokuz Eylul University, Izmir 35390, Turkey.

出版信息

Entropy (Basel). 2024 May 4;26(5):403. doi: 10.3390/e26050403.

DOI:10.3390/e26050403

PMID:38785652

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11120015/

Abstract

In a standard binary supervised classification task, the existence of both negative and positive samples in the training dataset are required to construct a classification model. However, this condition is not met in certain applications where only one class of samples is obtainable. To overcome this problem, a different classification method, which learns from positive and unlabeled (PU) data, must be incorporated. In this study, a novel method is presented: neighborhood-based positive unlabeled learning using decision tree (NPULUD). First, NPULUD uses the nearest neighborhood approach for the PU strategy and then employs a decision tree algorithm for the classification task by utilizing the entropy measure. Entropy played a pivotal role in assessing the level of uncertainty in the training dataset, as a decision tree was developed with the purpose of classification. Through experiments, we validated our method over 24 real-world datasets. The proposed method attained an average accuracy of 87.24%, while the traditional supervised learning approach obtained an average accuracy of 83.99% on the datasets. Additionally, it is also demonstrated that our method obtained a statistically notable enhancement (7.74%), with respect to state-of-the-art peers, on average.

摘要

在标准的二元监督分类任务中，训练数据集中需要同时存在负样本和正样本，以便构建分类模型。然而，在某些只能获取一类样本的应用中，这一条件无法满足。为了克服这个问题，必须采用一种从正样本和未标记（PU）数据中学习的不同分类方法。在本研究中，提出了一种新方法：基于邻域的使用决策树的正样本未标记学习（NPULUD）。首先，NPULUD将最近邻方法用于PU策略，然后通过利用熵度量，采用决策树算法进行分类任务。由于决策树是为分类目的而开发的，因此熵在评估训练数据集中的不确定性水平方面起着关键作用。通过实验，我们在24个真实世界数据集上验证了我们的方法。在这些数据集上，所提出的方法平均准确率达到87.24%，而传统监督学习方法的平均准确率为83.99%。此外，还证明我们的方法相对于同类先进方法平均在统计上有显著提高（7.74%）。