Faculty of Information Technology & State Key Laboratory of Quality Research in Chinese Medicines, Macau University of Science and Technology, Avenida Wai Long, Taipa, Macau, 999078, China.
Sci Rep. 2018 Aug 29;8(1):13009. doi: 10.1038/s41598-018-31395-5.
Traditional supervised learning classifier needs a lot of labeled samples to achieve good performance, however in many biological datasets there is only a small size of labeled samples and the remaining samples are unlabeled. Labeling these unlabeled samples manually is difficult or expensive. Technologies such as active learning and semi-supervised learning have been proposed to utilize the unlabeled samples for improving the model performance. However in active learning the model suffers from being short-sighted or biased and some manual workload is still needed. The semi-supervised learning methods are easy to be affected by the noisy samples. In this paper we propose a novel logistic regression model based on complementarity of active learning and semi-supervised learning, for utilizing the unlabeled samples with least cost to improve the disease classification accuracy. In addition to that, an update pseudo-labeled samples mechanism is designed to reduce the false pseudo-labeled samples. The experiment results show that this new model can achieve better performances compared the widely used semi-supervised learning and active learning methods in disease classification and gene selection.
传统的监督学习分类器需要大量的标记样本才能获得良好的性能,但在许多生物数据集,只有少量的标记样本,而其余的样本是未标记的。手动标记这些未标记的样本是困难或昂贵的。因此,提出了主动学习和半监督学习等技术,以利用未标记的样本来提高模型性能。然而,在主动学习中,模型存在目光短浅或偏见的问题,仍然需要一定的人工工作量。半监督学习方法容易受到噪声样本的影响。在本文中,我们提出了一种基于主动学习和半监督学习互补性的新型逻辑回归模型,用于以最小的成本利用未标记的样本,以提高疾病分类准确性。此外,还设计了一种更新伪标记样本的机制,以减少错误的伪标记样本。实验结果表明,与疾病分类和基因选择中广泛使用的半监督学习和主动学习方法相比,这种新模型可以取得更好的性能。