Department of Computer Science, University of California Los Angeles, Los Angeles, California, USA.
Center for Studies in Physics and Biology, Rockefeller University, New York, New York, USA.
mSystems. 2022 Oct 26;7(5):e0099521. doi: 10.1128/msystems.00995-21. Epub 2022 Sep 1.
Microbial source tracking analysis has emerged as a widespread technique for characterizing the properties of complex microbial communities. However, this analysis is currently limited to source environments sampled in a specific study. In order to expand the scope beyond one single study and allow the exploration of source environments using large databases and repositories, such as the Earth Microbiome Project, a source selection procedure is required. Such a procedure will allow differentiating between contributing environments and nuisance ones when the number of potential sources considered is high. Here, we introduce STENSL (microbial Source Tracking with ENvironment SeLection), a machine learning method that extends common microbial source tracking analysis by performing an unsupervised source selection and enabling sparse identification of latent source environments. By incorporating sparsity into the estimation of potential source environments, STENSL improves the accuracy of true source contribution, while significantly reducing the noise introduced by noncontributing ones. We therefore anticipate that source selection will augment microbial source tracking analyses, enabling exploration of multiple source environments from publicly available repositories while maintaining high accuracy of the statistical inference. Microbial source tracking is a powerful tool to characterize the properties of complex microbial communities. However, this analysis is currently limited to source environments sampled in a specific study. In many applications there is a clear need to consider source selection over a large array of microbial environments, external to the study. To this end, we developed STENSL (microbial Source Tracking with ENvironment SeLection), an expectation-maximization algorithm with sparsity that enables the identification of contributing sources among a large set of potential microbial environments. With the unprecedented expansion of microbiome data repositories such as the Earth Microbiome Project, recording over 200,000 samples from more than 50 types of categorized environments, STENSL takes the first steps in performing automated source exploration and selection. STENSL is significantly more accurate in identifying the contributing sources as well as the unknown source, even when considering hundreds of potential source environments, settings in which state-of-the-art microbial source tracking methods add considerable error.
微生物源追踪分析已成为一种广泛应用的技术,用于描述复杂微生物群落的特性。然而,这种分析目前仅限于在特定研究中采样的源环境。为了将范围扩展到单个研究之外,并允许使用大型数据库和存储库(如地球微生物组计划)探索源环境,需要进行源选择过程。当考虑的潜在源数量很高时,这样的过程将允许区分有贡献的环境和干扰环境。在这里,我们引入了 STENSL(微生物源追踪与环境选择),这是一种机器学习方法,通过执行无监督的源选择并实现潜在源环境的稀疏识别,扩展了常见的微生物源追踪分析。通过将稀疏性纳入潜在源环境的估计中,STENSL 提高了真实源贡献的准确性,同时显著减少了非贡献源引入的噪声。因此,我们预计源选择将增强微生物源追踪分析,允许从公开可用的存储库中探索多个源环境,同时保持统计推断的高精度。 微生物源追踪是一种强大的工具,可以描述复杂微生物群落的特性。然而,这种分析目前仅限于特定研究中采样的源环境。在许多应用中,明显需要考虑在研究之外的大量微生物环境中进行源选择。为此,我们开发了 STENSL(微生物源追踪与环境选择),这是一种具有稀疏性的期望最大化算法,能够在一大组潜在微生物环境中识别有贡献的源。随着地球微生物组计划等微生物组数据存储库的空前扩展,该计划记录了超过 20 万个来自 50 多种分类环境的样本,STENSL 在执行自动源探索和选择方面迈出了第一步。即使考虑数百个潜在源环境,STENSL 也能更准确地识别有贡献的源和未知源,而最新的微生物源追踪方法在这种情况下会增加相当大的误差。