通过聚类将宏基因组短读段分离成基因组。
Separating metagenomic short reads into genomes via clustering.
作者信息
Tanaseichuk Olga, Borneman James, Jiang Tao
机构信息
Department of Computer Science and Engineering, University of California, Riverside, CA, USA.
出版信息
Algorithms Mol Biol. 2012 Sep 26;7(1):27. doi: 10.1186/1748-7188-7-27.
BACKGROUND
The metagenomics approach allows the simultaneous sequencing of all genomes in an environmental sample. This results in high complexity datasets, where in addition to repeats and sequencing errors, the number of genomes and their abundance ratios are unknown. Recently developed next-generation sequencing (NGS) technologies significantly improve the sequencing efficiency and cost. On the other hand, they result in shorter reads, which makes the separation of reads from different species harder. Among the existing computational tools for metagenomic analysis, there are similarity-based methods that use reference databases to align reads and composition-based methods that use composition patterns (i.e., frequencies of short words or l-mers) to cluster reads. Similarity-based methods are unable to classify reads from unknown species without close references (which constitute the majority of reads). Since composition patterns are preserved only in significantly large fragments, composition-based tools cannot be used for very short reads, which becomes a significant limitation with the development of NGS. A recently proposed algorithm, AbundanceBin, introduced another method that bins reads based on predicted abundances of the genomes sequenced. However, it does not separate reads from genomes of similar abundance levels.
RESULTS
In this work, we present a two-phase heuristic algorithm for separating short paired-end reads from different genomes in a metagenomic dataset. We use the observation that most of the l-mers belong to unique genomes when l is sufficiently large. The first phase of the algorithm results in clusters of l-mers each of which belongs to one genome. During the second phase, clusters are merged based on l-mer repeat information. These final clusters are used to assign reads. The algorithm could handle very short reads and sequencing errors. It is initially designed for genomes with similar abundance levels and then extended to handle arbitrary abundance ratios. The software can be download for free at http://www.cs.ucr.edu/∼tanaseio/toss.htm.
CONCLUSIONS
Our tests on a large number of simulated metagenomic datasets concerning species at various phylogenetic distances demonstrate that genomes can be separated if the number of common repeats is smaller than the number of genome-specific repeats. For such genomes, our method can separate NGS reads with a high precision and sensitivity.
背景
宏基因组学方法允许对环境样本中的所有基因组进行同步测序。这会产生高度复杂的数据集,其中除了重复序列和测序错误外,基因组的数量及其丰度比也是未知的。最近开发的下一代测序(NGS)技术显著提高了测序效率并降低了成本。另一方面,它们产生的读段较短,这使得区分来自不同物种的读段变得更加困难。在现有的宏基因组分析计算工具中,有基于相似性的方法,该方法使用参考数据库来比对读段;还有基于组成的方法,该方法使用组成模式(即短单词或 l-mer 的频率)来对读段进行聚类。基于相似性的方法在没有相近参考序列(这构成了大多数读段)的情况下无法对来自未知物种的读段进行分类。由于组成模式仅在相当大的片段中得以保留,基于组成的工具不能用于非常短的读段,而随着 NGS 的发展,这成为了一个重大限制。最近提出的一种算法 AbundanceBin 引入了另一种基于测序基因组的预测丰度对读段进行分箱的方法。然而,它不能区分来自相似丰度水平基因组的读段。
结果
在这项工作中,我们提出了一种两阶段启发式算法,用于在宏基因组数据集中区分来自不同基因组的短双端读段。我们利用这样一个观察结果:当 l 足够大时,大多数 l-mer 属于唯一的基因组。该算法的第一阶段会生成 l-mer 聚类,每个聚类都属于一个基因组。在第二阶段,根据 l-mer 重复信息对聚类进行合并。这些最终的聚类用于分配读段。该算法可以处理非常短的读段和测序错误。它最初是为具有相似丰度水平的基因组设计的,然后进行了扩展以处理任意丰度比。该软件可在 http://www.cs.ucr.edu/∼tanaseio/toss.htm 免费下载。
结论
我们对大量关于不同系统发育距离物种的模拟宏基因组数据集进行的测试表明,如果共同重复序列的数量小于基因组特异性重复序列的数量,那么基因组是可以被区分的。对于这样的基因组,我们的方法能够以高精度和高灵敏度区分 NGS 读段。