Department of Computer Science, University of Maryland-College Park, College Park, MD 20742, USA.
BMC Bioinformatics. 2010 Mar 24;11:152. doi: 10.1186/1471-2105-11-152.
Molecular studies of microbial diversity have provided many insights into the bacterial communities inhabiting the human body and the environment. A common first step in such studies is a survey of conserved marker genes (primarily 16S rRNA) to characterize the taxonomic composition and diversity of these communities. To date, however, there exists significant variability in analysis methods employed in these studies.
Here we provide a critical assessment of current analysis methodologies that cluster sequences into operational taxonomic units (OTUs) and demonstrate that small changes in algorithm parameters can lead to significantly varying results. Our analysis provides strong evidence that the species-level diversity estimates produced using common OTU methodologies are inflated due to overly stringent parameter choices. We further describe an example of how semi-supervised clustering can produce OTUs that are more robust to changes in algorithm parameters.
Our results highlight the need for systematic and open evaluation of data analysis methodologies, especially as targeted 16S rRNA diversity studies are increasingly relying on high-throughput sequencing technologies. All data and results from our study are available through the JGI FAMeS website http://fames.jgi-psf.org/.
微生物多样性的分子研究为我们深入了解人体和环境中栖息的细菌群落提供了许多见解。此类研究的常见第一步是对保守的标记基因(主要是 16S rRNA)进行调查,以描述这些群落的分类组成和多样性。然而,迄今为止,这些研究中使用的分析方法存在很大的可变性。
在这里,我们对当前将序列聚类为操作分类单位(OTU)的分析方法进行了批判性评估,并证明算法参数的微小变化会导致结果显著变化。我们的分析提供了有力的证据,表明由于过于严格的参数选择,使用常见的 OTU 方法产生的物种水平多样性估计值被夸大了。我们进一步描述了一个示例,说明半监督聚类如何产生对算法参数变化更稳健的 OTU。
我们的研究结果强调需要对数据分析方法进行系统和公开的评估,特别是因为针对 16S rRNA 多样性的研究越来越依赖于高通量测序技术。我们研究的所有数据和结果都可通过 JGI FAMeS 网站 http://fames.jgi-psf.org/ 获取。