Suppr超能文献

一种基于丰度的新型算法,用于使用l元组对宏基因组序列进行分箱。

A novel abundance-based algorithm for binning metagenomic sequences using l-tuples.

作者信息

Wu Yu-Wei, Ye Yuzhen

机构信息

School of Informatics and Computing, Indiana University, Bloomington, Indiana, USA.

出版信息

J Comput Biol. 2011 Mar;18(3):523-34. doi: 10.1089/cmb.2010.0245.

Abstract

Metagenomics is the study of microbial communities sampled directly from their natural environment, without prior culturing. Among the computational tools recently developed for metagenomic sequence analysis, binning tools attempt to classify the sequences in a metagenomic dataset into different bins (i.e., species), based on various DNA composition patterns (e.g., the tetramer frequencies) of various genomes. Composition-based binning methods, however, cannot be used to classify very short fragments, because of the substantial variation of DNA composition patterns within a single genome. We developed a novel approach (AbundanceBin) for metagenomics binning by utilizing the different abundances of species living in the same environment. AbundanceBin is an application of the Lander-Waterman model to metagenomics, which is based on the l-tuple content of the reads. AbundanceBin achieved accurate, unsupervised, clustering of metagenomic sequences into different bins, such that the reads classified in a bin belong to species of identical or very similar abundances in the sample. In addition, AbundanceBin gave accurate estimations of species abundances, as well as their genome sizes-two important parameters for characterizing a microbial community. We also show that AbundanceBin performed well when the sequence lengths are very short (e.g., 75 bp) or have sequencing errors. By combining AbundanceBin and a composition-based method (MetaCluster), we can achieve even higher binning accuracy. Supplementary Material is available at www.liebertonline.com/cmb .

摘要

宏基因组学是对直接从自然环境中采样的微生物群落进行的研究,无需事先培养。在最近为宏基因组序列分析开发的计算工具中,分箱工具试图根据各种基因组的不同DNA组成模式(例如四聚体频率),将宏基因组数据集中的序列分类到不同的箱(即物种)中。然而,基于组成的分箱方法不能用于对非常短的片段进行分类,因为单个基因组内的DNA组成模式存在很大差异。我们开发了一种新的宏基因组学分箱方法(AbundanceBin),通过利用生活在同一环境中的物种的不同丰度。AbundanceBin是Lander-Waterman模型在宏基因组学中的应用,它基于读段的l元组含量。AbundanceBin实现了将宏基因组序列准确、无监督地聚类到不同的箱中,使得在一个箱中分类的读段属于样本中丰度相同或非常相似的物种。此外,AbundanceBin能够准确估计物种丰度及其基因组大小,这是表征微生物群落的两个重要参数。我们还表明,当序列长度非常短(例如75 bp)或存在测序错误时,AbundanceBin也能表现良好。通过将AbundanceBin与基于组成的方法(MetaCluster)相结合,我们可以实现更高的分箱准确性。补充材料可在www.liebertonline.com/cmb上获取。

相似文献

7
Exploiting topic modeling to boost metagenomic reads binning.利用主题建模来促进宏基因组读数分箱。
BMC Bioinformatics. 2015;16 Suppl 5(Suppl 5):S2. doi: 10.1186/1471-2105-16-S5-S2. Epub 2015 Mar 18.

引用本文的文献

本文引用的文献

3
Barcodes for genomes and applications.基因组条形码及其应用。
BMC Bioinformatics. 2008 Dec 17;9:546. doi: 10.1186/1471-2105-9-546.
4
A core gut microbiome in obese and lean twins.肥胖与消瘦双胞胎的核心肠道微生物群。
Nature. 2009 Jan 22;457(7228):480-4. doi: 10.1038/nature07540. Epub 2008 Nov 30.
6
Taxonomic distribution of large DNA viruses in the sea.海洋中大型DNA病毒的分类分布。
Genome Biol. 2008;9(7):R106. doi: 10.1186/gb-2008-9-7-r106. Epub 2008 Jul 3.
7
Functional metagenomic profiling of nine biomes.九个生物群落的功能宏基因组分析
Nature. 2008 Apr 3;452(7187):629-32. doi: 10.1038/nature06810. Epub 2008 Mar 12.
9
Phylogenetic classification of short environmental DNA fragments.短环境DNA片段的系统发育分类
Nucleic Acids Res. 2008 Apr;36(7):2230-9. doi: 10.1093/nar/gkn038. Epub 2008 Feb 19.
10
Figaro: a novel statistical method for vector sequence removal.费加罗:一种用于去除向量序列的新型统计方法。
Bioinformatics. 2008 Feb 15;24(4):462-7. doi: 10.1093/bioinformatics/btm632. Epub 2008 Jan 17.

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验