Centre Nutrition, Santé et Société (NUTRISS) - Institut sur la Nutrition et les Aliments Fonctionnels (INAF), Université Laval , Québec, Canada.
Canada Research Excellence Chair on the Microbiome-Endocannabinoidome Axis in Metabolic Health (CERC-MEND) , Quebec City, Quebec, Canada.
mSystems. 2023 Aug 31;8(4):e0053123. doi: 10.1128/msystems.00531-23. Epub 2023 Jul 5.
With the concomitant advances in both the microbiome and machine learning fields, the gut microbiome has become of great interest for the potential discovery of biomarkers to be used in the classification of the host health status. Shotgun metagenomics data derived from the human microbiome is composed of a high-dimensional set of microbial features. The use of such complex data for the modeling of host-microbiome interactions remains a challenge as retaining content yields a highly granular set of microbial features. In this study, we compared the prediction performances of machine learning approaches according to different types of data representations derived from shotgun metagenomics. These representations include commonly used taxonomic and functional profiles and the more granular gene cluster approach. For the five case-control datasets used in this study (Type 2 diabetes, obesity, liver cirrhosis, colorectal cancer, and inflammatory bowel disease), gene-based approaches, whether used alone or in combination with reference-based data types, allowed improved or similar classification performances as the taxonomic and functional profiles. In addition, we show that using subsets of gene families from specific functional categories of genes highlight the importance of these functions on the host phenotype. This study demonstrates that both reference-free microbiome representations and curated metagenomic annotations can provide relevant representations for machine learning based on metagenomic data. IMPORTANCE Data representation is an essential part of machine learning performance when using metagenomic data. In this work, we show that different microbiome representations provide varied host phenotype classification performance depending on the dataset. In classification tasks, untargeted microbiome gene content can provide similar or improved classification compared to taxonomical profiling. Feature selection based on biological function also improves classification performance for some pathologies. Function-based feature selection combined with interpretable machine learning algorithms can generate new hypotheses that can potentially be assayed mechanistically. This work thus proposes new approaches to represent microbiome data for machine learning that can potentiate the findings associated with metagenomic data.
随着微生物组学和机器学习领域的共同进步,肠道微生物组成为发现用于宿主健康状况分类的生物标志物的热点。源自人类微生物组的宏基因组学数据由一组高维微生物特征组成。由于保留内容会产生高度细化的微生物特征集,因此使用此类复杂数据来模拟宿主-微生物组相互作用仍然是一个挑战。在这项研究中,我们根据源自宏基因组学的不同类型数据表示形式比较了机器学习方法的预测性能。这些表示形式包括常用的分类和功能谱以及更细粒度的基因簇方法。对于本研究中使用的五个病例对照数据集(2 型糖尿病、肥胖症、肝硬化、结直肠癌和炎症性肠病),基于基因的方法,无论是单独使用还是与基于参考的数据类型结合使用,都允许改进或类似的分类性能,与分类和功能谱相同。此外,我们表明,使用特定功能类别基因的基因家族子集突出了这些功能对宿主表型的重要性。这项研究表明,参考免费的微生物组表示形式和经过策展的宏基因组注释都可以为基于宏基因组数据的机器学习提供相关表示形式。
重要性 当使用宏基因组数据时,数据表示是机器学习性能的重要组成部分。在这项工作中,我们表明,不同的微生物组表示形式根据数据集提供不同的宿主表型分类性能。在分类任务中,与分类学分析相比,无目标的微生物组基因含量可以提供相似或改善的分类。基于生物学功能的特征选择也可以提高某些病理的分类性能。功能基特征选择与可解释的机器学习算法相结合,可以生成新的假说,这些假说可能具有潜在的机制检验能力。因此,这项工作提出了用于机器学习的微生物组数据表示的新方法,可以增强与宏基因组数据相关的发现。