Ruiz Baptiste, Belcour Arnaud, Blanquart Samuel, Buffet-Bataillon Sylvie, Le Huërou-Luron Isabelle, Siegel Anne, Le Cunff Yann
University Rennes, Inria, CNRS, IRISA, Rennes, France.
University Grenoble Alpes, Inria, Grenoble, France.
PLoS Comput Biol. 2024 Nov 18;20(11):e1012577. doi: 10.1371/journal.pcbi.1012577. eCollection 2024 Nov.
The composition of the gut microbiota is a known factor in various diseases and has proven to be a strong basis for automatic classification of disease state. A need for a better understanding of microbiota data on the functional scale has since been voiced, as it would enhance these approaches' biological interpretability. In this paper, we have developed a computational pipeline for integrating the functional annotation of the gut microbiota into an automatic classification process and facilitating downstream interpretation of its results. The process takes as input taxonomic composition data, which can be built from 16S or whole genome sequencing, and links each component to its functional annotations through interrogation of the UniProt database. A functional profile of the gut microbiota is built from this basis. Both profiles, microbial and functional, are used to train Random Forest classifiers to discern unhealthy from control samples. SPARTA ensures full reproducibility and exploration of inherent variability by extending state-of-the-art methods in three dimensions: increased number of trained random forests, selection of important variables with an iterative process, repetition of full selection process from different seeds. This process shows that the translation of the microbiota into functional profiles gives non-significantly different performances when compared to microbial profiles on 5 of 6 datasets. This approach's main contribution however stems from its interpretability rather than its performance: through repetition, it also outputs a robust subset of discriminant variables. These selections were shown to be more consistent than those obtained by a state-of-the-art method, and their contents were validated through a manual bibliographic research. The interconnections between selected taxa and functional annotations were also analyzed and revealed that important annotations emerge from the cumulated influence of non-selected taxa.
肠道微生物群的组成是各种疾病中的一个已知因素,并且已被证明是疾病状态自动分类的有力基础。此后,人们提出需要在功能层面更好地理解微生物群数据,因为这将增强这些方法的生物学可解释性。在本文中,我们开发了一种计算流程,用于将肠道微生物群的功能注释整合到自动分类过程中,并促进对其结果的下游解释。该过程将分类组成数据作为输入,这些数据可以从16S或全基因组测序构建,并通过查询UniProt数据库将每个成分与其功能注释联系起来。在此基础上构建肠道微生物群的功能概况。微生物和功能这两种概况都用于训练随机森林分类器,以区分不健康样本和对照样本。SPARTA通过在三个维度上扩展现有方法来确保完全可重复性和对固有变异性的探索:增加训练的随机森林数量、通过迭代过程选择重要变量、从不同种子重复完整的选择过程。该过程表明,与6个数据集中5个数据集的微生物概况相比,将微生物群转化为功能概况时性能差异不显著。然而,这种方法的主要贡献源于其可解释性而非性能:通过重复,它还输出了一个稳健的判别变量子集。这些选择被证明比现有方法获得的选择更一致,并且它们的内容通过手动文献研究得到了验证。还分析了所选分类群与功能注释之间的相互联系,结果表明重要的注释来自未选分类群的累积影响。