Li Peikun, Li Min, Chen Wei-Hua
Key Laboratory of Molecular Biophysics of the Ministry of Education, Hubei Key Laboratory of Bioinformatics and Molecular Imaging, Center for Artificial Intelligence Biology, Department of Bioinformatics and Systems Biology, College of Life Science and Technology, Huazhong University of Science and Technology, Wuhan, Hubei, China.
School of Biological Science, Jining Medical University, Rizhao, China.
Gut Microbes. 2025 Dec;17(1):2489074. doi: 10.1080/19490976.2025.2489074. Epub 2025 Apr 4.
The human gut microbiome, crucial in various diseases, can be utilized to develop diagnostic models through machine learning (ML). The specific tools and parameters used in model construction such as data preprocessing, batch effect removal and modeling algorithms can impact model performance and generalizability. To establish an generally applicable workflow, we divided the ML process into three above-mentioned steps and optimized each sequentially using 83 gut microbiome cohorts across 20 diseases. We tested a total of 156 tool-parameter-algorithm combinations and benchmarked them according to internal- and external- AUCs. At the data preprocessing step, we identified four data preprocessing methods that performed well for regression-type algorithms and one method that excelled for non-regression-type algorithms. At the batch effect removal step, we identified the "ComBat" function from the R package as an effective batch effect removal method and compared the performance of various algorithms. Finally, at the ML algorithm selection step, we found that Ridge and Random Forest ranked the best. Our optimized work flow performed similarly comparing with previous exhaustive methods for disease-specific optimizations, thus is generally applicable and can provide a comprehensive guideline for constructing diagnostic models for a range of diseases, potentially serving as a powerful tool for future medical diagnostics.
人类肠道微生物群在多种疾病中起着关键作用,可通过机器学习(ML)用于开发诊断模型。模型构建中使用的特定工具和参数,如数据预处理、批次效应消除和建模算法,会影响模型性能和通用性。为了建立一个普遍适用的工作流程,我们将ML过程分为上述三个步骤,并使用来自20种疾病的83个肠道微生物群队列依次对每个步骤进行优化。我们总共测试了156种工具-参数-算法组合,并根据内部和外部AUC对它们进行基准测试。在数据预处理步骤中,我们确定了四种对回归型算法表现良好的数据预处理方法和一种对非回归型算法表现出色的方法。在批次效应消除步骤中,我们从R包中确定了“ComBat”函数作为一种有效的批次效应消除方法,并比较了各种算法的性能。最后,在ML算法选择步骤中,我们发现岭回归和随机森林排名最佳。我们优化的工作流程与之前针对特定疾病优化的详尽方法相比表现相似,因此具有普遍适用性,可以为构建一系列疾病的诊断模型提供全面的指导方针,有可能成为未来医学诊断的有力工具。