Institute of Molecular Life Sciences and Swiss Institute of Bioinformatics, University of Zurich, Zurich, Switzerland.
Zurich Institute of Forensic Medicine, University of Zurich, Zurich, Switzerland.
Microbiome. 2018 Oct 24;6(1):192. doi: 10.1186/s40168-018-0565-6.
The identification of body site-specific microbial biomarkers and their use for classification tasks have promising applications in medicine, microbial ecology, and forensics. Previous studies have characterized site-specific microbiota and shown that sample origin can be accurately predicted by microbial content. However, these studies were usually restricted to single datasets with consistent experimental methods and conditions, as well as comparatively small sample numbers. The effects of study-specific biases and statistical power on classification performance and biomarker identification thus remain poorly understood. Furthermore, reliable detection in mixtures of different body sites or with noise from environmental contamination has rarely been investigated thus far. Finally, the impact of ecological associations between microbes on biomarker discovery was usually not considered in previous work.
Here we present the analysis of one of the largest cross-study sequencing datasets of microbial communities from human body sites (15,082 samples from 57 publicly available studies). We show that training a Random Forest Classifier on this aggregated dataset increases prediction performance for body sites by 35% compared to a single-study classifier. Using simulated datasets, we further demonstrate that the source of different microbial contributions in mixtures of different body sites or with soil can be detected starting at 1% of the total microbial community. We apply a biomarker selection method that excludes indirect environmental associations driven by microbe-microbe associations, yielding a parsimonious set of highly predictive taxa including novel biomarkers and excluding many previously reported taxa. We find a considerable fraction of unclassified biomarkers ("microbial dark matter") and observe that negatively associated taxa have a surprisingly high impact on classification performance. We further detect a significant enrichment of rod-shaped, motile, and sporulating taxa for feces biomarkers, consistent with a highly competitive environment.
Our machine learning model shows strong body site classification performance, both in single-source samples and mixtures, making it promising for tasks requiring high accuracy, such as forensic applications. We report a core set of ecologically informed biomarkers, inferred across a wide range of experimental protocols and conditions, providing the most concise, general, and least biased overview of body site-associated microbes to date.
鉴定特定身体部位的微生物生物标志物及其在分类任务中的应用在医学、微生物生态学和法医学领域具有广阔的应用前景。以前的研究已经对特定部位的微生物群进行了特征描述,并表明可以通过微生物含量准确预测样本来源。然而,这些研究通常仅限于具有一致实验方法和条件的单个数据集,以及相对较小的样本数量。因此,研究特定偏差和统计能力对分类性能和生物标志物识别的影响仍知之甚少。此外,迄今为止,很少有研究可靠地检测不同身体部位的混合物或受到环境污染的噪声的影响。最后,在以前的工作中,通常不考虑微生物之间的生态关联对生物标志物发现的影响。
在这里,我们展示了对人体部位微生物群落最大的跨研究测序数据集之一的分析(来自 57 个公开可用研究的 15082 个样本)。我们表明,在这个聚合数据集上训练随机森林分类器可以将身体部位的预测性能提高 35%,而不是单个研究的分类器。使用模拟数据集,我们进一步表明,从总微生物群落的 1%开始,就可以检测到不同身体部位混合物或土壤中不同微生物贡献的来源。我们应用了一种生物标志物选择方法,该方法排除了由微生物-微生物关联驱动的间接环境关联,从而产生了一组简洁的高度预测分类群,包括新的生物标志物和许多以前报道的分类群。我们发现了相当一部分未分类的生物标志物(“微生物暗物质”),并观察到负相关分类群对分类性能有惊人的高影响。我们还检测到粪便生物标志物中存在大量的杆状、能动和有孢子的分类群,这与高度竞争的环境一致。
我们的机器学习模型在单源样本和混合物中都表现出很强的身体部位分类性能,这使其在需要高精度的任务中很有前景,例如法医应用。我们报告了一组具有生态意义的核心生物标志物,这些标志物是在广泛的实验方案和条件下推断出来的,这是迄今为止对与身体部位相关的微生物最简洁、最全面和最具偏差的概述。