Suppr超能文献

使用N-mer频率谱进行宏基因组片段分类。

Metagenome fragment classification using N-mer frequency profiles.

作者信息

Rosen Gail, Garbarine Elaine, Caseiro Diamantino, Polikar Robi, Sokhansanj Bahrad

机构信息

Department of Electrical and Computer Engineering, Drexel University, Philadelphia, PA 19104, USA.

出版信息

Adv Bioinformatics. 2008;2008:205969. doi: 10.1155/2008/205969. Epub 2008 Nov 16.

Abstract

A vast amount of microbial sequencing data is being generated through large-scale projects in ecology, agriculture, and human health. Efficient high-throughput methods are needed to analyze the mass amounts of metagenomic data, all DNA present in an environmental sample. A major obstacle in metagenomics is the inability to obtain accuracy using technology that yields short reads. We construct the unique N-mer frequency profiles of 635 microbial genomes publicly available as of February 2008. These profiles are used to train a naive Bayes classifier (NBC) that can be used to identify the genome of any fragment. We show that our method is comparable to BLAST for small 25 bp fragments but does not have the ambiguity of BLAST's tied top scores. We demonstrate that this approach is scalable to identify any fragment from hundreds of genomes. It also performs quite well at the strain, species, and genera levels and achieves strain resolution despite classifying ubiquitous genomic fragments (gene and nongene regions). Cross-validation analysis demonstrates that species-accuracy achieves 90% for highly-represented species containing an average of 8 strains. We demonstrate that such a tool can be used on the Sargasso Sea dataset, and our analysis shows that NBC can be further enhanced.

摘要

通过生态、农业和人类健康领域的大规模项目,正在产生大量的微生物测序数据。需要高效的高通量方法来分析海量的宏基因组数据,即环境样本中存在的所有DNA。宏基因组学的一个主要障碍是使用产生短读长的技术无法获得准确性。我们构建了截至2008年2月公开可用的635个微生物基因组的独特N-mer频率谱。这些谱用于训练朴素贝叶斯分类器(NBC),该分类器可用于识别任何片段的基因组。我们表明,对于25bp的小片段,我们的方法与BLAST相当,但没有BLAST并列最高分的模糊性。我们证明这种方法可扩展到从数百个基因组中识别任何片段。它在菌株、物种和属水平上也表现良好,并且尽管对普遍存在的基因组片段(基因和非基因区域)进行分类,但仍能实现菌株分辨率。交叉验证分析表明,对于平均包含8个菌株的高代表性物种,物种准确率达到90%。我们证明这样的工具可用于马尾藻海数据集,并且我们的分析表明NBC可以进一步增强。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/fa33/2777009/07d092b142a2/ABI2008-205969.001.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验