NEC Laboratories Europe, Heidelberg, Germany.
Department of Electrical Engineering and Computer Science, Khalifa University, Abu Dhabi, UAE.
PLoS Comput Biol. 2022 Apr 11;18(4):e1010050. doi: 10.1371/journal.pcbi.1010050. eCollection 2022 Apr.
Scientific research is shedding light on the interaction of the gut microbiome with the human host and on its role in human health. Existing machine learning methods have shown great potential in discriminating healthy from diseased microbiome states. Most of them leverage shotgun metagenomic sequencing to extract gut microbial species-relative abundances or strain-level markers. Each of these gut microbial profiling modalities showed diagnostic potential when tested separately; however, no existing approach combines them in a single predictive framework. Here, we propose the Multimodal Variational Information Bottleneck (MVIB), a novel deep learning model capable of learning a joint representation of multiple heterogeneous data modalities. MVIB achieves competitive classification performance while being faster than existing methods. Additionally, MVIB offers interpretable results. Our model adopts an information theoretic interpretation of deep neural networks and computes a joint stochastic encoding of different input data modalities. We use MVIB to predict whether human hosts are affected by a certain disease by jointly analysing gut microbial species-relative abundances and strain-level markers. MVIB is evaluated on human gut metagenomic samples from 11 publicly available disease cohorts covering 6 different diseases. We achieve high performance (0.80 < ROC AUC < 0.95) on 5 cohorts and at least medium performance on the remaining ones. We adopt a saliency technique to interpret the output of MVIB and identify the most relevant microbial species and strain-level markers to the model's predictions. We also perform cross-study generalisation experiments, where we train and test MVIB on different cohorts of the same disease, and overall we achieve comparable results to the baseline approach, i.e. the Random Forest. Further, we evaluate our model by adding metabolomic data derived from mass spectrometry as a third input modality. Our method is scalable with respect to input data modalities and has an average training time of < 1.4 seconds. The source code and the datasets used in this work are publicly available.
科学研究揭示了肠道微生物组与人类宿主的相互作用及其在人类健康中的作用。现有的机器学习方法在区分健康和患病的微生物组状态方面显示出巨大的潜力。它们大多利用 shotgun 宏基因组测序来提取肠道微生物物种相对丰度或菌株水平标记物。当单独测试时,这些肠道微生物特征分析模式中的每一种都显示出诊断潜力;然而,没有现有的方法将它们组合在一个单一的预测框架中。在这里,我们提出了多模态变分信息瓶颈(MVIB),这是一种新的深度学习模型,能够学习多个异构数据模态的联合表示。MVIB 实现了有竞争力的分类性能,同时比现有方法更快。此外,MVIB 提供了可解释的结果。我们的模型采用了对深度神经网络的信息论解释,并计算了不同输入数据模态的联合随机编码。我们使用 MVIB 通过联合分析肠道微生物物种相对丰度和菌株水平标记物来预测人类宿主是否受到某种疾病的影响。MVIB 基于 11 个公开的疾病队列中的人类肠道宏基因组样本进行评估,涵盖了 6 种不同的疾病。我们在 5 个队列中取得了很高的性能(0.80 < ROC AUC < 0.95),在其余的队列中至少取得了中等的性能。我们采用一种显著技术来解释 MVIB 的输出,并确定与模型预测最相关的微生物物种和菌株水平标记物。我们还进行了跨研究的泛化实验,即在同一疾病的不同队列中训练和测试 MVIB,总的来说,我们的结果与基线方法(即随机森林)相当。此外,我们通过添加基于质谱的代谢组学数据作为第三个输入模态来评估我们的模型。我们的方法在输入数据模态方面具有可扩展性,平均训练时间<1.4 秒。这项工作中使用的代码和数据集都是公开的。