通过聚类将宏基因组短读段分离成基因组。

Separating metagenomic short reads into genomes via clustering.

作者信息

Tanaseichuk Olga, Borneman James, Jiang Tao

机构信息

Department of Computer Science and Engineering, University of California, Riverside, CA, USA.

出版信息

Algorithms Mol Biol. 2012 Sep 26;7(1):27. doi: 10.1186/1748-7188-7-27.

DOI:10.1186/1748-7188-7-27

PMID:23009059

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC3537596/

Abstract

BACKGROUND

The metagenomics approach allows the simultaneous sequencing of all genomes in an environmental sample. This results in high complexity datasets, where in addition to repeats and sequencing errors, the number of genomes and their abundance ratios are unknown. Recently developed next-generation sequencing (NGS) technologies significantly improve the sequencing efficiency and cost. On the other hand, they result in shorter reads, which makes the separation of reads from different species harder. Among the existing computational tools for metagenomic analysis, there are similarity-based methods that use reference databases to align reads and composition-based methods that use composition patterns (i.e., frequencies of short words or l-mers) to cluster reads. Similarity-based methods are unable to classify reads from unknown species without close references (which constitute the majority of reads). Since composition patterns are preserved only in significantly large fragments, composition-based tools cannot be used for very short reads, which becomes a significant limitation with the development of NGS. A recently proposed algorithm, AbundanceBin, introduced another method that bins reads based on predicted abundances of the genomes sequenced. However, it does not separate reads from genomes of similar abundance levels.

RESULTS

In this work, we present a two-phase heuristic algorithm for separating short paired-end reads from different genomes in a metagenomic dataset. We use the observation that most of the l-mers belong to unique genomes when l is sufficiently large. The first phase of the algorithm results in clusters of l-mers each of which belongs to one genome. During the second phase, clusters are merged based on l-mer repeat information. These final clusters are used to assign reads. The algorithm could handle very short reads and sequencing errors. It is initially designed for genomes with similar abundance levels and then extended to handle arbitrary abundance ratios. The software can be download for free at http://www.cs.ucr.edu/∼tanaseio/toss.htm.

CONCLUSIONS

Our tests on a large number of simulated metagenomic datasets concerning species at various phylogenetic distances demonstrate that genomes can be separated if the number of common repeats is smaller than the number of genome-specific repeats. For such genomes, our method can separate NGS reads with a high precision and sensitivity.

摘要

背景

宏基因组学方法允许对环境样本中的所有基因组进行同步测序。这会产生高度复杂的数据集，其中除了重复序列和测序错误外，基因组的数量及其丰度比也是未知的。最近开发的下一代测序（NGS）技术显著提高了测序效率并降低了成本。另一方面，它们产生的读段较短，这使得区分来自不同物种的读段变得更加困难。在现有的宏基因组分析计算工具中，有基于相似性的方法，该方法使用参考数据库来比对读段；还有基于组成的方法，该方法使用组成模式（即短单词或 l-mer 的频率）来对读段进行聚类。基于相似性的方法在没有相近参考序列（这构成了大多数读段）的情况下无法对来自未知物种的读段进行分类。由于组成模式仅在相当大的片段中得以保留，基于组成的工具不能用于非常短的读段，而随着 NGS 的发展，这成为了一个重大限制。最近提出的一种算法 AbundanceBin 引入了另一种基于测序基因组的预测丰度对读段进行分箱的方法。然而，它不能区分来自相似丰度水平基因组的读段。

结果

在这项工作中，我们提出了一种两阶段启发式算法，用于在宏基因组数据集中区分来自不同基因组的短双端读段。我们利用这样一个观察结果：当 l 足够大时，大多数 l-mer 属于唯一的基因组。该算法的第一阶段会生成 l-mer 聚类，每个聚类都属于一个基因组。在第二阶段，根据 l-mer 重复信息对聚类进行合并。这些最终的聚类用于分配读段。该算法可以处理非常短的读段和测序错误。它最初是为具有相似丰度水平的基因组设计的，然后进行了扩展以处理任意丰度比。该软件可在 http://www.cs.ucr.edu/∼tanaseio/toss.htm 免费下载。

结论

我们对大量关于不同系统发育距离物种的模拟宏基因组数据集进行的测试表明，如果共同重复序列的数量小于基因组特异性重复序列的数量，那么基因组是可以被区分的。对于这样的基因组，我们的方法能够以高精度和高灵敏度区分 NGS 读段。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/c4fd/3537596/48c1f4c9561d/1748-7188-7-27-1.jpg

相似文献

Separating metagenomic short reads into genomes via clustering.通过聚类将宏基因组短读段分离成基因组。

Algorithms Mol Biol. 2012 Sep 26;7(1):27. doi: 10.1186/1748-7188-7-27.

A novel abundance-based algorithm for binning metagenomic sequences using l-tuples.一种基于丰度的新型算法，用于使用l元组对宏基因组序列进行分箱。

J Comput Biol. 2011 Mar;18(3):523-34. doi: 10.1089/cmb.2010.0245.

A two-phase binning algorithm using l-mer frequency on groups of non-overlapping reads.一种在非重叠读段组上使用l-mer频率的两阶段分箱算法。

Algorithms Mol Biol. 2015 Jan 16;10(1):2. doi: 10.1186/s13015-014-0030-4. eCollection 2015.

MetaCluster 4.0: a novel binning algorithm for NGS reads and huge number of species.MetaCluster 4.0：一种用于NGS读数和大量物种的新型分箱算法。

J Comput Biol. 2012 Feb;19(2):241-9. doi: 10.1089/cmb.2011.0276.

MetaObtainer: A Tool for Obtaining Specified Species from Metagenomic Reads of Next-generation Sequencing.MetaObtainer：一种从下一代测序宏基因组读数中获取特定物种的工具。

Interdiscip Sci. 2015 Dec;7(4):405-13. doi: 10.1007/s12539-015-0281-x. Epub 2015 Aug 21.

MetaProb 2: Metagenomic Reads Binning Based on Assembly Using Minimizers and K-Mers Statistics.MetaProb 2：基于组装使用最小化和 K- -mer 统计的宏基因组读取分箱。

J Comput Biol. 2021 Nov;28(11):1052-1062. doi: 10.1089/cmb.2021.0270. Epub 2021 Aug 26.

A robust and accurate binning algorithm for metagenomic sequences with arbitrary species abundance ratio.一种具有任意物种丰度比的宏基因组序列的健壮且准确的分箱算法。

Bioinformatics. 2011 Jun 1;27(11):1489-95. doi: 10.1093/bioinformatics/btr186. Epub 2011 Apr 14.

MBMC: An Effective Markov Chain Approach for Binning Metagenomic Reads from Environmental Shotgun Sequencing Projects.MBMC：一种用于对环境鸟枪法测序项目中的宏基因组读数进行分箱的有效马尔可夫链方法。

OMICS. 2016 Aug;20(8):470-9. doi: 10.1089/omi.2016.0081. Epub 2016 Jul 22.

Selection of marker genes for genetic barcoding of microorganisms and binning of metagenomic reads by Barcoder software tools.微生物遗传条形码标记基因的选择和 Barcoder 软件工具对宏基因组读段的分类。

BMC Bioinformatics. 2018 Aug 30;19(1):309. doi: 10.1186/s12859-018-2320-1.

A statistical framework for accurate taxonomic assignment of metagenomic sequencing reads.一种用于宏基因组测序reads 精确分类学分配的统计框架。

PLoS One. 2012;7(10):e46450. doi: 10.1371/journal.pone.0046450. Epub 2012 Oct 1.

引用本文的文献

Music of metagenomics-a review of its applications, analysis pipeline, and associated tools.宏基因组学音乐——应用、分析流程及其相关工具的综述。

Funct Integr Genomics. 2022 Feb;22(1):3-26. doi: 10.1007/s10142-021-00810-y. Epub 2021 Oct 18.

Alignment-free method for DNA sequence clustering using Fuzzy integral similarity.基于模糊积分相似度的无比对 DNA 序列聚类方法。

Sci Rep. 2019 Mar 6;9(1):3753. doi: 10.1038/s41598-019-40452-6.

A framework for space-efficient read clustering in metagenomic samples.宏基因组样本中空间高效读取聚类的框架。

BMC Bioinformatics. 2017 Mar 14;18(Suppl 3):59. doi: 10.1186/s12859-017-1466-6.

Fast and accurate phylogeny reconstruction using filtered spaced-word matches.使用过滤后的间隔词匹配进行快速准确的系统发育重建。

Bioinformatics. 2017 Apr 1;33(7):971-979. doi: 10.1093/bioinformatics/btw776.

LAF: Logic Alignment Free and its application to bacterial genomes classification.LAF：无逻辑比对及其在细菌基因组分类中的应用。

BioData Min. 2015 Dec 8;8:39. doi: 10.1186/s13040-015-0073-1. eCollection 2015.

A New Binning Method for Metagenomics by One-Dimensional Cellular Automata.一种基于一维细胞自动机的宏基因组学新分箱方法。

Int J Genomics. 2015;2015:197895. doi: 10.1155/2015/197895. Epub 2015 Oct 18.

Estimating evolutionary distances between genomic sequences from spaced-word matches.通过间隔词匹配估计基因组序列之间的进化距离。

Algorithms Mol Biol. 2015 Feb 11;10:5. doi: 10.1186/s13015-015-0032-x. eCollection 2015.

A two-phase binning algorithm using l-mer frequency on groups of non-overlapping reads.一种在非重叠读段组上使用l-mer频率的两阶段分箱算法。

Algorithms Mol Biol. 2015 Jan 16;10(1):2. doi: 10.1186/s13015-014-0030-4. eCollection 2015.

Predicting the functional repertoire of an organism from unassembled RNA-seq data.从未组装的RNA测序数据预测生物体的功能库。

BMC Genomics. 2014 Nov 20;15(1):1003. doi: 10.1186/1471-2164-15-1003.

MaxBin: an automated binning method to recover individual genomes from metagenomes using an expectation-maximization algorithm.MaxBin：一种基于期望最大化算法的自动分类方法，可从宏基因组中回收单个基因组。

Microbiome. 2014 Aug 1;2:26. doi: 10.1186/2049-2618-2-26. eCollection 2014.

本文引用的文献

Evaluation of short read metagenomic assembly.短读宏基因组组装评估。

BMC Genomics. 2011;12 Suppl 2(Suppl 2):S8. doi: 10.1186/1471-2164-12-S2-S8. Epub 2011 Jul 27.

A robust and accurate binning algorithm for metagenomic sequences with arbitrary species abundance ratio.一种具有任意物种丰度比的宏基因组序列的健壮且准确的分箱算法。

Bioinformatics. 2011 Jun 1;27(11):1489-95. doi: 10.1093/bioinformatics/btr186. Epub 2011 Apr 14.

ABySS: a parallel assembler for short read sequence data.ABySS：一种用于短读长序列数据的并行汇编器。

Genome Res. 2009 Jun;19(6):1117-23. doi: 10.1101/gr.089532.108. Epub 2009 Feb 27.

TACOA: taxonomic classification of environmental genomic fragments using a kernelized nearest neighbor approach.TACOA：使用核化最近邻方法对环境基因组片段进行分类学分类。

BMC Bioinformatics. 2009 Feb 11;10:56. doi: 10.1186/1471-2105-10-56.

Barcodes for genomes and applications.基因组条形码及其应用。

BMC Bioinformatics. 2008 Dec 17;9:546. doi: 10.1186/1471-2105-9-546.

GenBank.基因银行

Nucleic Acids Res. 2009 Jan;37(Database issue):D26-31. doi: 10.1093/nar/gkn723. Epub 2008 Oct 21.

MetaSim: a sequencing simulator for genomics and metagenomics.MetaSim：一款用于基因组学和宏基因组学的测序模拟器。

PLoS One. 2008 Oct 8;3(10):e3373. doi: 10.1371/journal.pone.0003373.

Binning sequences using very sparse labels within a metagenome.在宏基因组内使用非常稀疏的标签对序列进行分箱。

BMC Bioinformatics. 2008 Apr 28;9:215. doi: 10.1186/1471-2105-9-215.

Velvet: algorithms for de novo short read assembly using de Bruijn graphs.《天鹅绒：使用德布鲁因图进行从头短读长拼接的算法》

Genome Res. 2008 May;18(5):821-9. doi: 10.1101/gr.074492.107. Epub 2008 Mar 18.

Phylogenetic classification of short environmental DNA fragments.短环境DNA片段的系统发育分类

Nucleic Acids Res. 2008 Apr;36(7):2230-9. doi: 10.1093/nar/gkn038. Epub 2008 Feb 19.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

通过聚类将宏基因组短读段分离成基因组。

Separating metagenomic short reads into genomes via clustering.

作者信息

机构信息

出版信息

BACKGROUND

RESULTS

CONCLUSIONS

背景

结果

结论

相似文献

引用本文的文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献