Adachi Asahi, Zhang Fan, Kanaya Shigehiko, Ono Naoaki
Graduate School of Science and Technology, Nara Institute of Science and Technology, Ikoma 630-0192, Japan.
Data Science Center, Nara Institute of Science and Technology, Ikoma 630-0192, Japan.
Bioinform Adv. 2025 Mar 11;5(1):vbaf045. doi: 10.1093/bioadv/vbaf045. eCollection 2025.
The human microbiome is closely associated with the health and disease of the human host. Machine learning models have recently utilized the human microbiome to predict health conditions and disease status. Quantifying predictive uncertainty is essential for the reliable application of these microbiome-based prediction models in clinical settings. However, uncertainty quantification in such prediction models remains unexplored. In this study, we have developed a probabilistic prediction model using a Gaussian process (GP) with a kernel function that incorporates microbial community dissimilarities. We evaluated the performance of probabilistic prediction across three regression tasks: chronological age, body mass index, and disease severity, using publicly available human gut microbiome datasets. The results demonstrated that our model outperformed existing methods in terms of probabilistic prediction accuracy. Furthermore, we found that the confidence levels closely matched the empirical coverage and that data points predicted with lower uncertainty corresponded to lower prediction errors. These findings suggest that GP regression models incorporating community dissimilarities effectively capture the characteristics of phylogenetic, high-dimensional, and sparse microbial abundance data. Our study provides a more reliable framework for microbiome-based prediction, potentially advancing the application of microbiome data in health monitoring and disease diagnosis in clinical settings.
The code is available at https://github.com/asahiadachi/gp4microbiome.
人类微生物组与人类宿主的健康和疾病密切相关。机器学习模型最近利用人类微生物组来预测健康状况和疾病状态。量化预测不确定性对于这些基于微生物组的预测模型在临床环境中的可靠应用至关重要。然而,此类预测模型中的不确定性量化仍未得到探索。在本研究中,我们使用具有包含微生物群落差异的核函数的高斯过程(GP)开发了一种概率预测模型。我们使用公开可用的人类肠道微生物组数据集,在三个回归任务中评估了概率预测的性能: chronological年龄、体重指数和疾病严重程度。结果表明,我们的模型在概率预测准确性方面优于现有方法。此外,我们发现置信水平与经验覆盖率密切匹配,并且以较低不确定性预测的数据点对应较低的预测误差。这些发现表明,纳入群落差异的GP回归模型有效地捕捉了系统发育、高维和稀疏微生物丰度数据的特征。我们的研究为基于微生物组的预测提供了一个更可靠的框架,有可能推动微生物组数据在临床环境中的健康监测和疾病诊断中的应用。