Rosen Gail, Garbarine Elaine, Caseiro Diamantino, Polikar Robi, Sokhansanj Bahrad
Department of Electrical and Computer Engineering, Drexel University, Philadelphia, PA 19104, USA.
Adv Bioinformatics. 2008;2008:205969. doi: 10.1155/2008/205969. Epub 2008 Nov 16.
A vast amount of microbial sequencing data is being generated through large-scale projects in ecology, agriculture, and human health. Efficient high-throughput methods are needed to analyze the mass amounts of metagenomic data, all DNA present in an environmental sample. A major obstacle in metagenomics is the inability to obtain accuracy using technology that yields short reads. We construct the unique N-mer frequency profiles of 635 microbial genomes publicly available as of February 2008. These profiles are used to train a naive Bayes classifier (NBC) that can be used to identify the genome of any fragment. We show that our method is comparable to BLAST for small 25 bp fragments but does not have the ambiguity of BLAST's tied top scores. We demonstrate that this approach is scalable to identify any fragment from hundreds of genomes. It also performs quite well at the strain, species, and genera levels and achieves strain resolution despite classifying ubiquitous genomic fragments (gene and nongene regions). Cross-validation analysis demonstrates that species-accuracy achieves 90% for highly-represented species containing an average of 8 strains. We demonstrate that such a tool can be used on the Sargasso Sea dataset, and our analysis shows that NBC can be further enhanced.
通过生态、农业和人类健康领域的大规模项目,正在产生大量的微生物测序数据。需要高效的高通量方法来分析海量的宏基因组数据,即环境样本中存在的所有DNA。宏基因组学的一个主要障碍是使用产生短读长的技术无法获得准确性。我们构建了截至2008年2月公开可用的635个微生物基因组的独特N-mer频率谱。这些谱用于训练朴素贝叶斯分类器(NBC),该分类器可用于识别任何片段的基因组。我们表明,对于25bp的小片段,我们的方法与BLAST相当,但没有BLAST并列最高分的模糊性。我们证明这种方法可扩展到从数百个基因组中识别任何片段。它在菌株、物种和属水平上也表现良好,并且尽管对普遍存在的基因组片段(基因和非基因区域)进行分类,但仍能实现菌株分辨率。交叉验证分析表明,对于平均包含8个菌株的高代表性物种,物种准确率达到90%。我们证明这样的工具可用于马尾藻海数据集,并且我们的分析表明NBC可以进一步增强。