Caruso Vincent, Song Xubo, Asquith Mark, Karstens Lisa
Division of Bioinformatics and Computational Biology, Oregon Health and Science University, Portland, Oregon, USA.
Center for Spoken Language Understanding, Oregon Health and Science University, Portland, Oregon, USA.
mSystems. 2019 Feb 19;4(1). doi: 10.1128/mSystems.00163-18. eCollection 2019 Jan-Feb.
Microbiome community composition plays an important role in human health, and while most research to date has focused on high-microbial-biomass communities, low-biomass communities are also important. However, contamination and technical noise make determining the true community signal difficult when biomass levels are low, and the influence of varying biomass on sequence processing methods has received little attention. Here, we benchmarked six methods that infer community composition from 16S rRNA sequence reads, using samples of varying biomass. We included two operational taxonomic unit (OTU) clustering algorithms, one entropy-based method, and three more-recent amplicon sequence variant (ASV) methods. We first compared inference results from high-biomass mock communities to assess baseline performance. We then benchmarked the methods on a dilution series made from a single mock community-samples that varied only in biomass. ASVs/OTUs inferred by each method were classified as representing expected community, technical noise, or contamination. With the high-biomass data, we found that the ASV methods had good sensitivity and precision, whereas the other methods suffered in one area or in both. Inferred contamination was present only in small proportions. With the dilution series, contamination represented an increasing proportion of the data from the inferred communities, regardless of the inference method used. However, correlation between inferred contaminants and sample biomass was strongest for the ASV methods and weakest for the OTU methods. Thus, no inference method on its own can distinguish true community sequences from contaminant sequences, but ASV methods provide the most accurate characterization of community and contaminants. Microbial communities have important ramifications for human health, but determining their impact requires accurate characterization. Current technology makes microbiome sequence data more accessible than ever. However, popular software methods for analyzing these data are based on algorithms developed alongside older sequencing technology and smaller data sets and thus may not be adequate for modern, high-throughput data sets. Additionally, samples from environments where microbes are scarce present additional challenges to community characterization relative to high-biomass environments, an issue that is often ignored. We found that a new class of microbiome sequence processing tools, called amplicon sequence variant (ASV) methods, outperformed conventional methods. In samples representing low-biomass communities, where sample contamination becomes a significant confounding factor, the improved accuracy of ASV methods may allow more-robust computational identification of contaminants.
微生物群落组成在人类健康中起着重要作用,虽然迄今为止大多数研究都集中在高微生物生物量群落上,但低生物量群落也很重要。然而,当生物量水平较低时,污染和技术噪声使得确定真正的群落信号变得困难,并且不同生物量对序列处理方法的影响很少受到关注。在这里,我们使用不同生物量的样本对六种从16S rRNA序列读数推断群落组成的方法进行了基准测试。我们纳入了两种操作分类单元(OTU)聚类算法、一种基于熵的方法和三种更新的扩增子序列变体(ASV)方法。我们首先比较了高生物量模拟群落的推断结果,以评估基线性能。然后,我们在由单个模拟群落制成的稀释系列上对这些方法进行了基准测试,这些样本仅在生物量上有所不同。每种方法推断出的ASV/OTU被分类为代表预期群落、技术噪声或污染。对于高生物量数据,我们发现ASV方法具有良好的灵敏度和精度,而其他方法在一个或两个方面表现不佳。推断出的污染仅占小比例。对于稀释系列,无论使用何种推断方法,污染在推断群落的数据中所占比例都在增加。然而,对于ASV方法,推断出的污染物与样本生物量之间的相关性最强,而对于OTU方法则最弱。因此,没有一种推断方法能够单独将真正的群落序列与污染物序列区分开来,但ASV方法能够最准确地表征群落和污染物。微生物群落对人类健康有重要影响,但确定它们的影响需要准确的表征。当前技术使微生物组序列数据比以往任何时候都更容易获取。然而,用于分析这些数据的流行软件方法是基于与旧测序技术和较小数据集一起开发的算法,因此可能不足以处理现代的高通量数据集。此外,相对于高生物量环境,来自微生物稀缺环境的样本在群落表征方面带来了额外的挑战,而这个问题常常被忽视。我们发现,一类新的微生物组序列处理工具,即扩增子序列变体(ASV)方法,优于传统方法。在代表低生物量群落的样本中,样本污染成为一个重要的混杂因素,ASV方法提高的准确性可能允许对污染物进行更稳健的计算识别。