使用N-mer频率谱进行宏基因组片段分类。

Metagenome fragment classification using N-mer frequency profiles.

作者信息

Rosen Gail, Garbarine Elaine, Caseiro Diamantino, Polikar Robi, Sokhansanj Bahrad

机构信息

Department of Electrical and Computer Engineering, Drexel University, Philadelphia, PA 19104, USA.

出版信息

Adv Bioinformatics. 2008;2008:205969. doi: 10.1155/2008/205969. Epub 2008 Nov 16.

DOI:10.1155/2008/205969

PMID:19956701

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC2777009/

Abstract

A vast amount of microbial sequencing data is being generated through large-scale projects in ecology, agriculture, and human health. Efficient high-throughput methods are needed to analyze the mass amounts of metagenomic data, all DNA present in an environmental sample. A major obstacle in metagenomics is the inability to obtain accuracy using technology that yields short reads. We construct the unique N-mer frequency profiles of 635 microbial genomes publicly available as of February 2008. These profiles are used to train a naive Bayes classifier (NBC) that can be used to identify the genome of any fragment. We show that our method is comparable to BLAST for small 25 bp fragments but does not have the ambiguity of BLAST's tied top scores. We demonstrate that this approach is scalable to identify any fragment from hundreds of genomes. It also performs quite well at the strain, species, and genera levels and achieves strain resolution despite classifying ubiquitous genomic fragments (gene and nongene regions). Cross-validation analysis demonstrates that species-accuracy achieves 90% for highly-represented species containing an average of 8 strains. We demonstrate that such a tool can be used on the Sargasso Sea dataset, and our analysis shows that NBC can be further enhanced.

摘要

通过生态、农业和人类健康领域的大规模项目，正在产生大量的微生物测序数据。需要高效的高通量方法来分析海量的宏基因组数据，即环境样本中存在的所有DNA。宏基因组学的一个主要障碍是使用产生短读长的技术无法获得准确性。我们构建了截至2008年2月公开可用的635个微生物基因组的独特N-mer频率谱。这些谱用于训练朴素贝叶斯分类器（NBC），该分类器可用于识别任何片段的基因组。我们表明，对于25bp的小片段，我们的方法与BLAST相当，但没有BLAST并列最高分的模糊性。我们证明这种方法可扩展到从数百个基因组中识别任何片段。它在菌株、物种和属水平上也表现良好，并且尽管对普遍存在的基因组片段（基因和非基因区域）进行分类，但仍能实现菌株分辨率。交叉验证分析表明，对于平均包含8个菌株的高代表性物种，物种准确率达到90%。我们证明这样的工具可用于马尾藻海数据集，并且我们的分析表明NBC可以进一步增强。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/fa33/2777009/07d092b142a2/ABI2008-205969.001.jpg

相似文献

Metagenome fragment classification using N-mer frequency profiles.使用N-mer频率谱进行宏基因组片段分类。

Adv Bioinformatics. 2008;2008:205969. doi: 10.1155/2008/205969. Epub 2008 Nov 16.

Metagenome fragment classification based on multiple motif-occurrence profiles.基于多重模体出现谱的宏基因组片段分类。

PeerJ. 2014 Sep 4;2:e559. doi: 10.7717/peerj.559. eCollection 2014.

Comparison of statistical methods to classify environmental genomic fragments.比较用于分类环境基因组片段的统计方法。

IEEE Trans Nanobioscience. 2010 Dec;9(4):310-6. doi: 10.1109/TNB.2010.2081375. Epub 2010 Sep 27.

NBC: the Naive Bayes Classification tool webserver for taxonomic classification of metagenomic reads.NBC：用于宏基因组读取分类的朴素贝叶斯分类工具网络服务器。

Bioinformatics. 2011 Jan 1;27(1):127-9. doi: 10.1093/bioinformatics/btq619. Epub 2010 Nov 8.

NBC update: The addition of viral and fungal databases to the Naïve Bayes classification tool.美国全国广播公司最新消息：将病毒和真菌数据库添加到朴素贝叶斯分类工具中。

BMC Res Notes. 2012 Jan 31;5:81. doi: 10.1186/1756-0500-5-81.

Benchmarking blast accuracy of genus/phyla classification of metagenomic reads.对宏基因组读数的属/门分类的比对准确性进行基准测试。

Pac Symp Biocomput. 2010:10-20. doi: 10.1142/9789814295291_0003.

MinION™ nanopore sequencing of environmental metagenomes: a synthetic approach.环境宏基因组的MinION™纳米孔测序：一种合成方法。

Gigascience. 2017 Mar 1;6(3):1-10. doi: 10.1093/gigascience/gix007.

Classifying short genomic fragments from novel lineages using composition and homology.基于组成和同源性对新谱系的短基因组片段进行分类。

BMC Bioinformatics. 2011 Aug 9;12:328. doi: 10.1186/1471-2105-12-328.

MNBC: a multithreaded Minimizer-based Naïve Bayes Classifier for improved metagenomic sequence classification.MNBC：一种基于多线程 Minimizer 的朴素贝叶斯分类器，用于改进宏基因组序列分类。

Bioinformatics. 2024 Oct 1;40(10). doi: 10.1093/bioinformatics/btae601.

A comprehensive investigation of metagenome assembly by linked-read sequencing.基于链接读取测序的宏基因组组装综合研究。

Microbiome. 2020 Nov 11;8(1):156. doi: 10.1186/s40168-020-00929-3.

引用本文的文献

K-mer-based Approaches to Bridging Pangenomics and Population Genetics.基于K-mer的泛基因组学与群体遗传学关联方法。

Mol Biol Evol. 2025 Mar 5;42(3). doi: 10.1093/molbev/msaf047.

Species annotation using a k-mer based KNN model.使用基于k-mer的K近邻模型进行物种注释。

Bioinformation. 2024 Sep 30;20(9):986-989. doi: 10.6026/973206300200986. eCollection 2024.

The Naïve Bayes classifier++ for metagenomic taxonomic classification-query evaluation.用于宏基因组分类学分类查询评估的朴素贝叶斯分类器++

Bioinformatics. 2024 Dec 26;41(1). doi: 10.1093/bioinformatics/btae743.

Missing microbial eukaryotes and misleading meta-omic conclusions.缺失的微生物真核生物和误导性的宏基因组学结论。

Nat Commun. 2024 Nov 14;15(1):9873. doi: 10.1038/s41467-024-52212-w.

Bioinformatics. 2024 Oct 1;40(10). doi: 10.1093/bioinformatics/btae601.

HiTaxon: a hierarchical ensemble framework for taxonomic classification of short reads.HiTaxon：一种用于短读段分类学分类的分层集成框架。

Bioinform Adv. 2024 Feb 1;4(1):vbae016. doi: 10.1093/bioadv/vbae016. eCollection 2024.

Improving taxonomic classification with feature space balancing.通过特征空间平衡改进分类学分类。

Bioinform Adv. 2023 Jul 17;3(1):vbad092. doi: 10.1093/bioadv/vbad092. eCollection 2023.

POSMM: an efficient alignment-free metagenomic profiler that complements alignment-based profiling.POSMM：一种高效的无比对宏基因组分析工具，可补充基于比对的分析。

Environ Microbiome. 2023 Mar 8;18(1):16. doi: 10.1186/s40793-023-00476-y.

A convenient correspondence between k-mer-based metagenomic distances and phylogenetically-informed β-diversity measures.基于 k-mer 的宏基因组距离与基于系统发育信息的 β 多样性测度之间的便捷对应关系。

PLoS Comput Biol. 2023 Jan 6;19(1):e1010821. doi: 10.1371/journal.pcbi.1010821. eCollection 2023 Jan.

Strain level microbial detection and quantification with applications to single cell metagenomics.利用单细胞宏基因组学进行菌株水平微生物检测和定量。

Nat Commun. 2022 Oct 28;13(1):6430. doi: 10.1038/s41467-022-33869-7.

本文引用的文献

A comparison of random sequence reads versus 16S rDNA sequences for estimating the biodiversity of a metagenomic library.用于估计宏基因组文库生物多样性的随机序列读数与16S rDNA序列的比较。

Nucleic Acids Res. 2008 Sep;36(16):5180-8. doi: 10.1093/nar/gkn496. Epub 2008 Aug 5.

Proteomic analysis of stationary phase in the marine bacterium "Candidatus Pelagibacter ubique".海洋细菌“嗜盐栖热袍菌”（Candidatus Pelagibacter ubique）稳定期的蛋白质组学分析。

Appl Environ Microbiol. 2008 Jul;74(13):4091-100. doi: 10.1128/AEM.00599-08. Epub 2008 May 9.

Phylogenetic classification of short environmental DNA fragments.短环境DNA片段的系统发育分类

Nucleic Acids Res. 2008 Apr;36(7):2230-9. doi: 10.1093/nar/gkn038. Epub 2008 Feb 19.

Metagenomics: read length matters.宏基因组学：读长很重要。

Appl Environ Microbiol. 2008 Mar;74(5):1453-63. doi: 10.1128/AEM.02181-07. Epub 2008 Jan 11.

Absent sequences: nullomers and primes.缺失序列：零聚体和引物。

Pac Symp Biocomput. 2007:355-66. doi: 10.1142/9789812772435_0034.

Naïve Bayes for microRNA target predictions--machine learning for microRNA targets.用于微小RNA靶标预测的朴素贝叶斯——微小RNA靶标的机器学习

Bioinformatics. 2007 Nov 15;23(22):2987-92. doi: 10.1093/bioinformatics/btm484. Epub 2007 Oct 8.

Naive Bayesian classifier for rapid assignment of rRNA sequences into the new bacterial taxonomy.用于将rRNA序列快速分类到新细菌分类学中的朴素贝叶斯分类器。

Appl Environ Microbiol. 2007 Aug;73(16):5261-7. doi: 10.1128/AEM.00062-07. Epub 2007 Jun 22.

Comparative genomics and the evolution of prokaryotes.比较基因组学与原核生物的进化

Trends Microbiol. 2007 Mar;15(3):135-41. doi: 10.1016/j.tim.2007.01.007. Epub 2007 Feb 7.

MEGAN analysis of metagenomic data.宏基因组数据的MEGAN分析

Genome Res. 2007 Mar;17(3):377-86. doi: 10.1101/gr.5969107. Epub 2007 Jan 25.

Accurate phylogenetic classification of variable-length DNA fragments.可变长度DNA片段的精确系统发育分类。

Nat Methods. 2007 Jan;4(1):63-72. doi: 10.1038/nmeth976. Epub 2006 Dec 10.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

使用N-mer频率谱进行宏基因组片段分类。

Metagenome fragment classification using N-mer frequency profiles.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献