Department of Biology, University of Southern California, University Park, Los Angeles, CA 90089, USA.
Bioinformatics. 2011 Mar 1;27(5):611-8. doi: 10.1093/bioinformatics/btq725. Epub 2011 Jan 13.
With the advancements of next-generation sequencing technology, it is now possible to study samples directly obtained from the environment. Particularly, 16S rRNA gene sequences have been frequently used to profile the diversity of organisms in a sample. However, such studies are still taxed to determine both the number of operational taxonomic units (OTUs) and their relative abundance in a sample.
To address these challenges, we propose an unsupervised Bayesian clustering method termed Clustering 16S rRNA for OTU Prediction (CROP). CROP can find clusters based on the natural organization of data without setting a hard cut-off threshold (3%/5%) as required by hierarchical clustering methods. By applying our method to several datasets, we demonstrate that CROP is robust against sequencing errors and that it produces more accurate results than conventional hierarchical clustering methods.
Source code freely available at the following URL: http://code.google.com/p/crop-tingchenlab/, implemented in C++ and supported on Linux and MS Windows.
随着下一代测序技术的进步,现在可以直接研究从环境中获得的样本。特别是,16S rRNA 基因序列经常被用于分析样本中生物的多样性。然而,这些研究仍然需要确定样本中的操作分类单元 (OTUs) 的数量及其相对丰度。
为了解决这些挑战,我们提出了一种无监督的贝叶斯聚类方法,称为聚类 16S rRNA 用于 OTU 预测 (CROP)。CROP 可以根据数据的自然组织找到聚类,而不需要像层次聚类方法那样设置硬性的截止阈值 (3%/5%)。通过将我们的方法应用于几个数据集,我们证明了 CROP 对测序错误具有鲁棒性,并且比传统的层次聚类方法产生更准确的结果。
源代码可在以下网址免费获得:http://code.google.com/p/crop-tingchenlab/,用 C++实现,支持 Linux 和 MS Windows。