Monshizadeh Mahsa, Hong Yuhui, Ye Yuzhen
Computer Science Department, Indiana University, Bloomington, IN 47408, United States.
Bioinform Adv. 2024 Dec 13;5(1):vbae203. doi: 10.1093/bioadv/vbae203. eCollection 2025.
Microbial signatures in the human microbiome are closely associated with various human diseases, driving the development of machine learning models for microbiome-based disease prediction. Despite progress, challenges remain in enhancing prediction accuracy, generalizability, and interpretability. Confounding factors, such as host's gender, age, and body mass index, significantly influence the human microbiome, complicating microbiome-based predictions.
To address these challenges, we developed MicroKPNN-MT, a unified model for predicting human phenotype based on microbiome data, as well as additional metadata like age and gender. This model builds on our earlier MicroKPNN framework, which incorporates prior knowledge of microbial species into neural networks to enhance prediction accuracy and interpretability. In MicroKPNN-MT, metadata, when available, serves as additional input features for prediction. Otherwise, the model predicts metadata from microbiome data using additional decoders. We applied MicroKPNN-MT to microbiome data collected in mBodyMap, covering healthy individuals and 25 different diseases, and demonstrated its potential as a predictive tool for multiple diseases, which at the same time provided predictions for the missing metadata. Our results showed that incorporating real or predicted metadata helped improve the accuracy of disease predictions, and more importantly, helped improve the generalizability of the predictive models.
人类微生物组中的微生物特征与多种人类疾病密切相关,推动了基于微生物组的疾病预测机器学习模型的发展。尽管取得了进展,但在提高预测准确性、通用性和可解释性方面仍存在挑战。宿主的性别、年龄和体重指数等混杂因素会显著影响人类微生物组,使基于微生物组的预测变得复杂。
为应对这些挑战,我们开发了MicroKPNN-MT,这是一种基于微生物组数据以及年龄和性别等其他元数据来预测人类表型的统一模型。该模型基于我们早期的MicroKPNN框架构建,该框架将微生物物种的先验知识纳入神经网络以提高预测准确性和可解释性。在MicroKPNN-MT中,元数据(如果可用)用作预测的额外输入特征。否则,该模型使用额外的解码器从微生物组数据中预测元数据。我们将MicroKPNN-MT应用于在mBodyMap中收集的微生物组数据,这些数据涵盖健康个体和25种不同疾病,并证明了其作为多种疾病预测工具的潜力,同时还为缺失的元数据提供了预测。我们的结果表明,纳入真实或预测的元数据有助于提高疾病预测的准确性,更重要的是,有助于提高预测模型的通用性。