Pevzner P A, Tang H
Department of Computer Science and Engineering, University of California at San Diego, La Jolla, CA 92093, USA.
Bioinformatics. 2001;17 Suppl 1:S225-33. doi: 10.1093/bioinformatics/17.suppl_1.s225.
For the last twenty years fragment assembly was dominated by the "overlap - layout - consensus" algorithms that are used in all currently available assembly tools. However, the limits of these algorithms are being tested in the era of genomic sequencing and it is not clear whether they are the best choice for large-scale assemblies. Although the "overlap - layout - consensus" approach proved to be useful in assembling clones, it faces difficulties in genomic assemblies: the existing algorithms make assembly errors even in bacterial genomes. We abandoned the "overlap - layout - consensus" approach in favour of a new Eulerian Superpath approach that outperforms the existing algorithms for genomic fragment assembly (Pevzner et al. 2001 InProceedings of the Fifth Annual International Conference on Computational Molecular Biology (RECOMB-01), 256-26). In this paper we describe our new EULER-DB algorithm that, similarly to the Celera assembler takes advantage of clone-end sequencing by using the double-barreled data. However, in contrast to the Celera assembler, EULER-DB does not mask repeats but uses them instead as a powerful tool for contig ordering. We also describe a new approach for the Copy Number Problem: "How many times a given repeat is present in the genome?". For long nearly-perfect repeats this question is notoriously difficult and some copies of such repeats may be "lost" in genomic assemblies. We describe our EULER-CN algorithm for the Copy Number Problem that proved to be successful in difficult sequencing projects.
在过去的二十年里,片段组装一直由“重叠-布局-共识”算法主导,所有当前可用的组装工具都使用这种算法。然而,在基因组测序时代,这些算法的局限性正在受到考验,而且它们是否是大规模组装的最佳选择尚不清楚。尽管“重叠-布局-共识”方法在克隆组装中被证明是有用的,但它在基因组组装中面临困难:现有的算法即使在细菌基因组组装中也会产生错误。我们放弃了“重叠-布局-共识”方法,转而采用一种新的欧拉超级路径方法,该方法在基因组片段组装方面优于现有算法(佩夫兹纳等人,2001年,《第五届计算分子生物学年度国际会议论文集》(RECOMB-01),第256-26页)。在本文中,我们描述了我们的新EULER-DB算法,与Celera组装器类似,该算法通过使用双管数据利用克隆末端测序。然而,与Celera组装器不同的是,EULER-DB不屏蔽重复序列,而是将它们用作重叠群排序的强大工具。我们还描述了一种解决拷贝数问题的新方法:“基因组中给定的重复序列出现了多少次?”。对于长的近乎完美的重复序列,这个问题非常困难,而且这种重复序列的一些拷贝可能会在基因组组装中“丢失”。我们描述了我们用于拷贝数问题的EULER-CN算法,该算法在困难的测序项目中被证明是成功的。