Gu Bin, Zhai Zhou, Deng Cheng, Huang Heng
IEEE Trans Neural Netw Learn Syst. 2021 Sep;32(9):4111-4122. doi: 10.1109/TNNLS.2020.3016928. Epub 2021 Aug 31.
Active learning is an important learning paradigm in machine learning and data mining, which aims to train effective classifiers with as few labeled samples as possible. Querying discriminative (informative) and representative samples are the state-of-the-art approach for active learning. Fully utilizing a large amount of unlabeled data provides a second chance to improve the performance of active learning. Although there have been several active learning methods proposed by combining with semisupervised learning, fast active learning with fully exploiting unlabeled data and querying discriminative and representative samples is still an open question. To overcome this challenging issue, in this article, we propose a new efficient batch mode active learning algorithm. Specifically, we first provide an active learning risk bound by fully considering the unlabeled samples in characterizing the informativeness and representativeness. Based on the risk bound, we derive a new objective function for batch mode active learning. After that, we propose a wrapper algorithm to solve the objective function, which essentially trains a semisupervised classifier and selects discriminative and representative samples alternately. Especially, to avoid retraining the semisupervised classifier from scratch after each query, we design two unique procedures based on the path-following technique, which can remove multiple queried samples from the unlabeled data set and add the queried samples into the labeled data set efficiently. Extensive experimental results on a variety of benchmark data sets not only show that our algorithm has a better generalization performance than the state-of-the-art active learning approaches but also show its significant efficiency.
主动学习是机器学习和数据挖掘中的一种重要学习范式,旨在用尽可能少的标记样本训练有效的分类器。查询有判别力(信息丰富)和代表性的样本是主动学习的最新方法。充分利用大量未标记数据为提高主动学习性能提供了第二次机会。尽管已经提出了几种结合半监督学习的主动学习方法,但如何在充分利用未标记数据并查询有判别力和代表性样本的情况下实现快速主动学习仍然是一个悬而未决的问题。为了克服这一具有挑战性的问题,在本文中,我们提出了一种新的高效批处理模式主动学习算法。具体来说,我们首先通过在表征信息性和代表性时充分考虑未标记样本,给出了一个主动学习风险界。基于该风险界,我们推导出了批处理模式主动学习的一个新目标函数。之后,我们提出了一种包装算法来求解该目标函数,该算法本质上是交替训练一个半监督分类器并选择有判别力和代表性的样本。特别是,为了避免每次查询后从头重新训练半监督分类器,我们基于路径跟踪技术设计了两个独特的过程,它们可以从未标记数据集中有效地移除多个查询样本,并将查询样本添加到标记数据集中。在各种基准数据集上的大量实验结果不仅表明我们的算法比现有主动学习方法具有更好的泛化性能,而且还显示了其显著的效率。