Biosystems and Biomaterials Division, National Institute of Standards and Technology, 100 Bureau Dr., Gaithersburg, 20899, MD, USA.
Center for Bioinformatics and Computational Biology, University of Maryland, College Park, 8314 Paint Branch Dr., College Park, 20742, MD, USA.
Microbiome. 2020 Mar 13;8(1):35. doi: 10.1186/s40168-020-00812-1.
There are a variety of bioinformatic pipelines and downstream analysis methods for analyzing 16S rRNA marker-gene surveys. However, appropriate assessment datasets and metrics are needed as there is limited guidance to decide between available analysis methods. Mixtures of environmental samples are useful for assessing analysis methods as one can evaluate methods based on calculated expected values using unmixed sample measurements and the mixture design. Previous studies have used mixtures of environmental samples to assess other sequencing methods such as RNAseq. But no studies have used mixtures of environmental to assess 16S rRNA sequencing.
We developed a framework for assessing 16S rRNA sequencing analysis methods which utilizes a novel two-sample titration mixture dataset and metrics to evaluate qualitative and quantitative characteristics of count tables. Our qualitative assessment evaluates feature presence/absence exploiting features only present in unmixed samples or titrations by testing if random sampling can account for their observed relative abundance. Our quantitative assessment evaluates feature relative and differential abundance by comparing observed and expected values. We demonstrated the framework by evaluating count tables generated with three commonly used bioinformatic pipelines: (i) DADA2 a sequence inference method, (ii) Mothur a de novo clustering method, and (iii) QIIME an open-reference clustering method. The qualitative assessment results indicated that the majority of Mothur and QIIME features only present in unmixed samples or titrations were accounted for by random sampling alone, but this was not the case for DADA2 features. Combined with count table sparsity (proportion of zero-valued cells in a count table), these results indicate DADA2 has a higher false-negative rate whereas Mothur and QIIME have higher false-positive rates. The quantitative assessment results indicated the observed relative abundance and differential abundance values were consistent with expected values for all three pipelines.
We developed a novel framework for assessing 16S rRNA marker-gene survey methods and demonstrated the framework by evaluating count tables generated with three bioinformatic pipelines. This framework is a valuable community resource for assessing 16S rRNA marker-gene survey bioinformatic methods and will help scientists identify appropriate analysis methods for their marker-gene surveys.
分析 16S rRNA 标记基因调查有各种生物信息学管道和下游分析方法。然而,需要适当的评估数据集和指标,因为在决定可用分析方法时,指导有限。环境样本的混合物可用于评估分析方法,因为可以根据使用未混合样本测量值和混合物设计计算的预期值来评估方法。以前的研究已经使用环境样本混合物来评估其他测序方法,例如 RNAseq。但是,没有研究使用环境混合物来评估 16S rRNA 测序。
我们开发了一种用于评估 16S rRNA 测序分析方法的框架,该框架利用新颖的两样本滴定混合物数据集和指标来评估计数表的定性和定量特征。我们的定性评估通过测试随机抽样是否可以解释其观察到的相对丰度,利用仅存在于未混合样本或滴定中的特征来评估特征的存在/不存在。我们的定量评估通过比较观察值和预期值来评估特征的相对丰度和差异丰度。我们通过评估三种常用生物信息学管道生成的计数表来演示该框架:(i)DADA2 一种序列推断方法,(ii)Mothur 一种从头聚类方法,和(iii)QIIME 一种开放参考聚类方法。定性评估结果表明,Mothur 和 QIIME 的大多数仅存在于未混合样本或滴定中的特征仅通过随机抽样就可以解释,但是 DADA2 特征则不然。结合计数表稀疏性(计数表中零值单元格的比例),这些结果表明 DADA2 的假阴性率较高,而 Mothur 和 QIIME 的假阳性率较高。定量评估结果表明,所有三个管道的观察相对丰度和差异丰度值与预期值一致。
我们开发了一种评估 16S rRNA 标记基因调查方法的新框架,并通过评估三种生物信息学管道生成的计数表来演示该框架。该框架是评估 16S rRNA 标记基因调查生物信息学方法的有价值的社区资源,将帮助科学家为他们的标记基因调查确定适当的分析方法。