Department of Statistics and Finance, University of Science and Technology of China, Hefei 230026, China.
Department of Biostatistics, Mailman School of Public Health, Columbia University, New York, NY 10032, United States.
Bioinformatics. 2024 Jan 2;40(1). doi: 10.1093/bioinformatics/btad757.
Research on human microbiome has suggested associations with human health, opening opportunities to predict health outcomes using microbiome. Studies have also suggested that diverse forms of taxa such as rare taxa that are evolutionally related and abundant taxa that are evolutionally unrelated could be associated with or predictive of a health outcome. Although prediction models were developed for microbiome data, no prediction models currently exist that use multiple forms of microbiome-outcome associations.
We developed MK-BMC, a Multi-Kernel framework with Boosted distance Metrics for Classification using microbiome data. We propose to first boost widely used distance metrics for microbiome data using taxon-level association signal strengths to up-weight taxa that are potentially associated with an outcome of interest. We then propose a multi-kernel prediction model with one kernel capturing one form of association between taxa and the outcome, where a kernel measures similarities of microbiome compositions between pairs of samples being transformed from a proposed boosted distance metric. We demonstrated superior prediction performance of (i) boosted distance metrics for microbiome data over original ones and (ii) MK-BMC over competing methods through extensive simulations. We applied MK-BMC to predict thyroid, obesity, and inflammatory bowel disease status using gut microbiome data from the American Gut Project and observed much-improved prediction performance over that of competing methods. The learned kernel weights help us understand contributions of individual microbiome signal forms nicely.
Source code together with a sample input dataset is available at https://github.com/HXu06/MK-BMC.
人类微生物组的研究表明其与人类健康有关,这为使用微生物组预测健康结果提供了机会。研究还表明,不同形式的分类群,如进化上相关的稀有分类群和进化上不相关的丰富分类群,可能与健康结果有关或具有预测性。尽管已经为微生物组数据开发了预测模型,但目前还没有使用多种微生物组-结果关联的预测模型。
我们开发了 MK-BMC,这是一种使用微生物组数据的多核框架,具有基于分类的增强距离度量。我们建议首先使用基于分类群关联信号强度的广泛使用的距离度量来增强微生物组数据,以对潜在与感兴趣的结果相关的分类群进行加权。然后,我们提出了一种多核预测模型,其中一个核捕捉分类群与结果之间的一种关联形式,其中一个核测量从提出的增强距离度量转换的一对样本之间的微生物组组成的相似性。通过广泛的模拟,我们证明了(i)增强的微生物组数据距离度量优于原始距离度量,以及(ii)MK-BMC 优于竞争方法的优越预测性能。我们应用 MK-BMC 来预测甲状腺、肥胖和炎症性肠病状态,使用来自美国肠道计划的肠道微生物组数据,并观察到优于竞争方法的预测性能得到了很大提高。学习到的核权重可以帮助我们很好地理解个体微生物组信号形式的贡献。
源代码以及一个示例输入数据集可在 https://github.com/HXu06/MK-BMC 上获得。