Department of Biology and Centre for Geobiology, University of Bergen, Bergen, Norway.
PLoS One. 2012;7(11):e49334. doi: 10.1371/journal.pone.0049334. Epub 2012 Nov 8.
Sequencing of taxonomic or phylogenetic markers is becoming a fast and efficient method for studying environmental microbial communities. This has resulted in a steadily growing collection of marker sequences, most notably of the small-subunit (SSU) ribosomal RNA gene, and an increased understanding of microbial phylogeny, diversity and community composition patterns. However, to utilize these large datasets together with new sequencing technologies, a reliable and flexible system for taxonomic classification is critical. We developed CREST (Classification Resources for Environmental Sequence Tags), a set of resources and tools for generating and utilizing custom taxonomies and reference datasets for classification of environmental sequences. CREST uses an alignment-based classification method with the lowest common ancestor algorithm. It also uses explicit rank similarity criteria to reduce false positives and identify novel taxa. We implemented this method in a web server, a command line tool and the graphical user interfaced program MEGAN. Further, we provide the SSU rRNA reference database and taxonomy SilvaMod, derived from the publicly available SILVA SSURef, for classification of sequences from bacteria, archaea and eukaryotes. Using cross-validation and environmental datasets, we compared the performance of CREST and SilvaMod to the RDP Classifier. We also utilized Greengenes as a reference database, both with CREST and the RDP Classifier. These analyses indicate that CREST performs better than alignment-free methods with higher recall rate (sensitivity) as well as precision, and with the ability to accurately identify most sequences from novel taxa. Classification using SilvaMod performed better than with Greengenes, particularly when applied to environmental sequences. CREST is freely available under a GNU General Public License (v3) from http://apps.cbu.uib.no/crest and http://lcaclassifier.googlecode.com.
分类学或系统发育标记物的测序正成为研究环境微生物群落的一种快速而有效的方法。这导致了标记序列(尤其是小亚基(SSU)核糖体 RNA 基因)的不断增加,并提高了对微生物系统发育、多样性和群落组成模式的理解。然而,要利用这些大型数据集和新的测序技术,就需要一个可靠且灵活的分类学系统。我们开发了 CREST(环境序列标签分类资源),这是一套用于生成和利用自定义分类和参考数据集来对环境序列进行分类的资源和工具。CREST 使用基于比对的分类方法和最小编辑距离算法。它还使用显式等级相似性标准来减少假阳性并识别新的分类单元。我们在网络服务器、命令行工具和图形用户界面程序 MEGAN 中实现了这种方法。此外,我们还提供了 SSU rRNA 参考数据库和 SilvaMod 分类学,这是从公共 SILVA SSURef 数据库衍生而来的,用于对细菌、古菌和真核生物的序列进行分类。我们使用交叉验证和环境数据集,将 CREST 和 SilvaMod 与 RDP 分类器的性能进行了比较。我们还利用 Greengenes 作为参考数据库,同时使用 CREST 和 RDP 分类器。这些分析表明,CREST 比基于比对的方法具有更高的召回率(敏感性)和精度,并且能够准确识别大多数来自新分类单元的序列。使用 SilvaMod 进行分类比使用 Greengenes 更好,尤其是在应用于环境序列时。CREST 可在 GNU 通用公共许可证 v3 下免费从 http://apps.cbu.uib.no/crest 和 http://lcaclassifier.googlecode.com 获得。