一种基于丰度的新型算法，用于使用l元组对宏基因组序列进行分箱。

A novel abundance-based algorithm for binning metagenomic sequences using l-tuples.

作者信息

Wu Yu-Wei, Ye Yuzhen

机构信息

School of Informatics and Computing, Indiana University, Bloomington, Indiana, USA.

出版信息

J Comput Biol. 2011 Mar;18(3):523-34. doi: 10.1089/cmb.2010.0245.

DOI:10.1089/cmb.2010.0245

PMID:21385052

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC3123841/

Abstract

Metagenomics is the study of microbial communities sampled directly from their natural environment, without prior culturing. Among the computational tools recently developed for metagenomic sequence analysis, binning tools attempt to classify the sequences in a metagenomic dataset into different bins (i.e., species), based on various DNA composition patterns (e.g., the tetramer frequencies) of various genomes. Composition-based binning methods, however, cannot be used to classify very short fragments, because of the substantial variation of DNA composition patterns within a single genome. We developed a novel approach (AbundanceBin) for metagenomics binning by utilizing the different abundances of species living in the same environment. AbundanceBin is an application of the Lander-Waterman model to metagenomics, which is based on the l-tuple content of the reads. AbundanceBin achieved accurate, unsupervised, clustering of metagenomic sequences into different bins, such that the reads classified in a bin belong to species of identical or very similar abundances in the sample. In addition, AbundanceBin gave accurate estimations of species abundances, as well as their genome sizes-two important parameters for characterizing a microbial community. We also show that AbundanceBin performed well when the sequence lengths are very short (e.g., 75 bp) or have sequencing errors. By combining AbundanceBin and a composition-based method (MetaCluster), we can achieve even higher binning accuracy. Supplementary Material is available at www.liebertonline.com/cmb .

摘要

宏基因组学是对直接从自然环境中采样的微生物群落进行的研究，无需事先培养。在最近为宏基因组序列分析开发的计算工具中，分箱工具试图根据各种基因组的不同DNA组成模式（例如四聚体频率），将宏基因组数据集中的序列分类到不同的箱（即物种）中。然而，基于组成的分箱方法不能用于对非常短的片段进行分类，因为单个基因组内的DNA组成模式存在很大差异。我们开发了一种新的宏基因组学分箱方法（AbundanceBin），通过利用生活在同一环境中的物种的不同丰度。AbundanceBin是Lander-Waterman模型在宏基因组学中的应用，它基于读段的l元组含量。AbundanceBin实现了将宏基因组序列准确、无监督地聚类到不同的箱中，使得在一个箱中分类的读段属于样本中丰度相同或非常相似的物种。此外，AbundanceBin能够准确估计物种丰度及其基因组大小，这是表征微生物群落的两个重要参数。我们还表明，当序列长度非常短（例如75 bp）或存在测序错误时，AbundanceBin也能表现良好。通过将AbundanceBin与基于组成的方法（MetaCluster）相结合，我们可以实现更高的分箱准确性。补充材料可在www.liebertonline.com/cmb上获取。

相似文献

A novel abundance-based algorithm for binning metagenomic sequences using l-tuples.一种基于丰度的新型算法，用于使用l元组对宏基因组序列进行分箱。

J Comput Biol. 2011 Mar;18(3):523-34. doi: 10.1089/cmb.2010.0245.

A New Unsupervised Binning Approach for Metagenomic Sequences Based on N-grams and Automatic Feature Weighting.一种基于N元语法和自动特征加权的宏基因组序列无监督分箱新方法。

IEEE/ACM Trans Comput Biol Bioinform. 2014 Jan-Feb;11(1):42-54. doi: 10.1109/TCBB.2013.137.

MetaCluster 4.0: a novel binning algorithm for NGS reads and huge number of species.MetaCluster 4.0：一种用于NGS读数和大量物种的新型分箱算法。

J Comput Biol. 2012 Feb;19(2):241-9. doi: 10.1089/cmb.2011.0276.

MetaProb 2: Metagenomic Reads Binning Based on Assembly Using Minimizers and K-Mers Statistics.MetaProb 2：基于组装使用最小化和 K- -mer 统计的宏基因组读取分箱。

J Comput Biol. 2021 Nov;28(11):1052-1062. doi: 10.1089/cmb.2021.0270. Epub 2021 Aug 26.

Improving contig binning of metagenomic data using [Formula: see text] oligonucleotide frequency dissimilarity.使用[公式：见正文]寡核苷酸频率差异改进宏基因组数据的重叠群分箱

BMC Bioinformatics. 2017 Sep 20;18(1):425. doi: 10.1186/s12859-017-1835-1.

A robust and accurate binning algorithm for metagenomic sequences with arbitrary species abundance ratio.一种具有任意物种丰度比的宏基因组序列的健壮且准确的分箱算法。

Bioinformatics. 2011 Jun 1;27(11):1489-95. doi: 10.1093/bioinformatics/btr186. Epub 2011 Apr 14.

Exploiting topic modeling to boost metagenomic reads binning.利用主题建模来促进宏基因组读数分箱。

BMC Bioinformatics. 2015;16 Suppl 5(Suppl 5):S2. doi: 10.1186/1471-2105-16-S5-S2. Epub 2015 Mar 18.

Evaluating metagenomics tools for genome binning with real metagenomic datasets and CAMI datasets.评估宏基因组工具在真实宏基因组数据集和 CAMI 数据集上的基因组 binning 效果。

BMC Bioinformatics. 2020 Jul 28;21(1):334. doi: 10.1186/s12859-020-03667-3.

CoMet: a workflow using contig coverage and composition for binning a metagenomic sample with high precision.CoMet：一种使用 contig 覆盖度和组成进行宏基因组样本高精度分箱的工作流程。

BMC Bioinformatics. 2017 Dec 28;18(Suppl 16):571. doi: 10.1186/s12859-017-1967-3.

Separating metagenomic short reads into genomes via clustering.通过聚类将宏基因组短读段分离成基因组。

Algorithms Mol Biol. 2012 Sep 26;7(1):27. doi: 10.1186/1748-7188-7-27.

引用本文的文献

A review of neural networks for metagenomic binning.宏基因组分箱的神经网络综述。

Brief Bioinform. 2025 Mar 4;26(2). doi: 10.1093/bib/bbaf065.

MetaComBin: combining abundances and overlaps for binning metagenomics reads.MetaComBin：结合丰度和重叠以对宏基因组reads进行分箱

Front Bioinform. 2025 Mar 3;5:1504728. doi: 10.3389/fbinf.2025.1504728. eCollection 2025.

Targeted protein evolution in the gut microbiome by diversity-generating retroelements.通过多样性产生逆转录元件在肠道微生物组中进行靶向蛋白质进化。

bioRxiv. 2024 Nov 16:2024.11.15.621889. doi: 10.1101/2024.11.15.621889.

Binning Metagenomic Contigs Using Contig Embedding and Decomposed Tetranucleotide Frequency.利用重叠群嵌入和分解四核苷酸频率对宏基因组重叠群进行分箱

Biology (Basel). 2024 Sep 24;13(10):755. doi: 10.3390/biology13100755.

Solving genomic puzzles: computational methods for metagenomic binning.解决基因组难题：宏基因组 binning 的计算方法。

Brief Bioinform. 2024 Jul 25;25(5). doi: 10.1093/bib/bbae372.

AFITbin: a metagenomic contig binning method using aggregate l-mer frequency based on initial and terminal nucleotides.AﬁTbin：一种基于初始和末端核苷酸的基于聚合 l-mer 频率的宏基因组序列拼接方法。

BMC Bioinformatics. 2024 Jul 16;25(1):241. doi: 10.1186/s12859-024-05859-7.

MetaTrass: A high-quality metagenome assembler of the human gut microbiome by cobarcoding sequencing reads.MetaTrass：一种通过共条形码测序读数对人类肠道微生物组进行高质量宏基因组组装的工具。

Imeta. 2022 Aug 15;1(4):e46. doi: 10.1002/imt2.46. eCollection 2022 Dec.

Comparison of k-mer-based comparative metagenomic tools and approaches.基于k-mer的比较宏基因组学工具和方法的比较。

Microbiome Res Rep. 2023 Jul 20;2(4):27. doi: 10.20517/mrr.2023.26. eCollection 2023.

Amplicon sequencing allows differential quantification of closely related parasite species: an example from rodent Coccidia (Eimeria).扩增子测序可对密切相关的寄生虫物种进行差异定量：以啮齿动物球虫（艾美耳球虫）为例。

Parasit Vectors. 2023 Jun 17;16(1):204. doi: 10.1186/s13071-023-05800-6.

An Improved Machine Learning-Based Approach to Assess the Microbial Diversity in Major North Indian River Ecosystems.基于改进的机器学习方法评估印度主要北方河流生态系统中的微生物多样性。

Genes (Basel). 2023 May 14;14(5):1082. doi: 10.3390/genes14051082.

本文引用的文献

Phymm and PhymmBL: metagenomic phylogenetic classification with interpolated Markov models.Phymm和PhymmBL：基于插值马尔可夫模型的宏基因组系统发育分类

Nat Methods. 2009 Sep;6(9):673-6. doi: 10.1038/nmeth.1358. Epub 2009 Aug 2.

TACOA: taxonomic classification of environmental genomic fragments using a kernelized nearest neighbor approach.TACOA：使用核化最近邻方法对环境基因组片段进行分类学分类。

BMC Bioinformatics. 2009 Feb 11;10:56. doi: 10.1186/1471-2105-10-56.

Barcodes for genomes and applications.基因组条形码及其应用。

BMC Bioinformatics. 2008 Dec 17;9:546. doi: 10.1186/1471-2105-9-546.

A core gut microbiome in obese and lean twins.肥胖与消瘦双胞胎的核心肠道微生物群。

Nature. 2009 Jan 22;457(7228):480-4. doi: 10.1038/nature07540. Epub 2008 Nov 30.

A simple, fast, and accurate method of phylogenomic inference.一种简单、快速且准确的系统发育基因组推断方法。

Genome Biol. 2008 Oct 13;9(10):R151. doi: 10.1186/gb-2008-9-10-r151.

Taxonomic distribution of large DNA viruses in the sea.海洋中大型DNA病毒的分类分布。

Genome Biol. 2008;9(7):R106. doi: 10.1186/gb-2008-9-7-r106. Epub 2008 Jul 3.

Functional metagenomic profiling of nine biomes.九个生物群落的功能宏基因组分析

Nature. 2008 Apr 3;452(7187):629-32. doi: 10.1038/nature06810. Epub 2008 Mar 12.

Microbial ecology of four coral atolls in the Northern Line Islands.北莱恩群岛四个珊瑚环礁的微生物生态学

PLoS One. 2008 Feb 27;3(2):e1584. doi: 10.1371/journal.pone.0001584.

Phylogenetic classification of short environmental DNA fragments.短环境DNA片段的系统发育分类

Nucleic Acids Res. 2008 Apr;36(7):2230-9. doi: 10.1093/nar/gkn038. Epub 2008 Feb 19.

Figaro: a novel statistical method for vector sequence removal.费加罗：一种用于去除向量序列的新型统计方法。

Bioinformatics. 2008 Feb 15;24(4):462-7. doi: 10.1093/bioinformatics/btm632. Epub 2008 Jan 17.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验