Interdisciplinary Program in Bioinformatics, Seoul National University, Seoul, 151-742, Republic of Korea.
C&K genomics, Seoul National University Research Park, Seoul, 151-919, Republic of Korea.
Sci Rep. 2019 Jul 15;9(1):10189. doi: 10.1038/s41598-019-46249-x.
Diseases prediction has been performed by machine learning approaches with various biological data. One of the representative data is the gut microbial community, which interacts with the host's immune system. The abundance of a few microorganisms has been used as markers to predict diverse diseases. In this study, we hypothesized that multi-classification using machine learning approach could distinguish the gut microbiome from following six diseases: multiple sclerosis, juvenile idiopathic arthritis, myalgic encephalomyelitis/chronic fatigue syndrome, acquired immune deficiency syndrome, stroke and colorectal cancer. We used the abundance of microorganisms at five taxonomy levels as features in 696 samples collected from different studies to establish the best prediction model. We built classification models based on four multi-class classifiers and two feature selection methods including a forward selection and a backward elimination. As a result, we found that the performance of classification is improved as we use the lower taxonomy levels of features; the highest performance was observed at the genus level. Among four classifiers, LogitBoost-based prediction model outperformed other classifiers. Also, we suggested the optimal feature subsets at the genus-level obtained by backward elimination. We believe the selected feature subsets could be used as markers to distinguish various diseases simultaneously. The finding in this study suggests the potential use of selected features for the diagnosis of several diseases.
疾病预测已经通过机器学习方法和各种生物数据来实现。其中一种有代表性的数据是肠道微生物群落,它与宿主的免疫系统相互作用。一些微生物的丰度已被用作预测多种疾病的标志物。在这项研究中,我们假设使用机器学习方法进行多分类可以区分以下六种疾病的肠道微生物组:多发性硬化症、青少年特发性关节炎、肌痛性脑脊髓炎/慢性疲劳综合征、获得性免疫缺陷综合征、中风和结直肠癌。我们使用来自不同研究的 696 个样本中微生物的丰度作为特征,在五个分类学水平上建立了最佳预测模型。我们基于四个多类分类器和两种特征选择方法(包括前向选择和后向消除)构建了分类模型。结果表明,随着我们使用特征的较低分类学水平,分类的性能得到了提高;在属水平上观察到的性能最高。在四个分类器中,基于 LogitBoost 的预测模型优于其他分类器。此外,我们还提出了通过后向消除获得的属水平上的最优特征子集。我们认为所选特征子集可用于同时区分各种疾病。这项研究的结果表明,所选特征在几种疾病的诊断中有潜在的应用价值。