Yin Chunwu, Chen Zhanbo
School of Information and Control Engineering, Xi'an University of Architecture and Technology, Xi'an 710055, China.
School of Information and Statistics, Guangxi University of Finance and Economics, Nanning 530003, China.
Healthcare (Basel). 2020 Aug 24;8(3):291. doi: 10.3390/healthcare8030291.
Disease classification based on machine learning has become a crucial research topic in the fields of genetics and molecular biology. Generally, disease classification involves a supervised learning style; i.e., it requires a large number of labelled samples to achieve good classification performance. However, in the majority of the cases, labelled samples are hard to obtain, so the amount of training data are limited. However, many unclassified (unlabelled) sequences have been deposited in public databases, which may help the training procedure. This method is called semi-supervised learning and is very useful in many applications. Self-training can be implemented using high- to low-confidence samples to prevent noisy samples from affecting the robustness of semi-supervised learning in the training process. The deep forest method with the hyperparameter settings used in this paper can achieve excellent performance. Therefore, in this work, we propose a novel combined deep learning model and semi-supervised learning with self-training approach to improve the performance in disease classification, which utilizes unlabelled samples to update a mechanism designed to increase the number of high-confidence pseudo-labelled samples. The experimental results show that our proposed model can achieve good performance in disease classification and disease-causing gene identification.
基于机器学习的疾病分类已成为遗传学和分子生物学领域的一个关键研究课题。一般来说,疾病分类涉及一种监督学习方式;也就是说,它需要大量带标签的样本才能获得良好的分类性能。然而,在大多数情况下,带标签的样本很难获得,因此训练数据的数量有限。然而,许多未分类(未标记)的序列已存入公共数据库,这可能有助于训练过程。这种方法称为半监督学习,在许多应用中非常有用。自训练可以通过使用高置信度到低置信度的样本实现,以防止噪声样本在训练过程中影响半监督学习的稳健性。本文中使用的具有超参数设置的深度森林方法可以实现优异的性能。因此,在这项工作中,我们提出了一种新颖的深度学习模型与半监督学习相结合的自训练方法,以提高疾病分类的性能,该方法利用未标记样本更新一种机制,旨在增加高置信度伪标记样本的数量。实验结果表明,我们提出的模型在疾病分类和致病基因识别方面可以取得良好的性能。