Department of Biotechnology, College of Life Science and Biotechnology, Yonsei University, Seoul, Republic of Korea.
POSTECH Biotech Center, Pohang University of Science and Technology (POSTECH), Pohang, Republic of Korea.
Gut Microbes. 2024 Jan-Dec;16(1):2375679. doi: 10.1080/19490976.2024.2375679. Epub 2024 Jul 7.
The gut microbiome, linked significantly to host diseases, offers potential for disease diagnosis through machine learning (ML) pipelines. These pipelines, crucial in modeling diseases using high-dimensional microbiome data, involve selecting profile modalities, data preprocessing techniques, and classification algorithms, each impacting the model accuracy and generalizability. Despite whole metagenome shotgun sequencing (WMS) gaining popularity for human gut microbiome profiling, a consensus on the optimal methods for ML pipelines in disease diagnosis using WMS data remains elusive. Addressing this gap, we comprehensively evaluated ML methods for diagnosing Crohn's disease and colorectal cancer, using 2,553 fecal WMS samples from 21 case-control studies. Our study uncovered crucial insights: gut-specific, species-level taxonomic features proved to be the most effective for profiling; batch correction was not consistently beneficial for model performance; compositional data transformations markedly improved the models; and while nonlinear ensemble classification algorithms typically offered superior performance, linear models with proper regularization were found to be more effective for diseases that are linearly separable based on microbiome data. An optimal ML pipeline, integrating the most effective methods, was validated for generalizability using holdout data. This research offers practical guidelines for constructing reliable disease diagnostic ML models with fecal WMS data.
肠道微生物组与宿主疾病密切相关,通过机器学习 (ML) 管道为疾病诊断提供了潜力。这些管道在使用高维微生物组数据建模疾病方面至关重要,涉及选择分析模式、数据预处理技术和分类算法,每个环节都对模型的准确性和泛化能力产生影响。尽管全宏基因组鸟枪法测序 (WMS) 在人类肠道微生物组分析中越来越受欢迎,但在使用 WMS 数据进行疾病诊断的 ML 管道的最佳方法方面仍缺乏共识。为了解决这一差距,我们使用来自 21 项病例对照研究的 2553 个粪便 WMS 样本,全面评估了用于诊断克罗恩病和结直肠癌的 ML 方法。我们的研究揭示了一些重要的见解:针对肠道特异性、物种水平的分类特征被证明是最有效的分析方法;批次校正并不总是有利于模型性能;组成数据转换显著改善了模型;虽然非线性集成分类算法通常提供了更好的性能,但对于基于微生物组数据可线性分离的疾病,具有适当正则化的线性模型被发现更为有效。通过使用保留数据验证,为具有普遍适用性的最优 ML 管道提供了综合方法。这项研究为使用粪便 WMS 数据构建可靠的疾病诊断 ML 模型提供了实用的指导原则。