Department of Computer Science and Helsinki Institute for Information Technology HIIT, FI-00014 University of Helsinki, Finland and LIRMM and Institut de Biologie Computationelle, CNRS and Université Montpellier, 34095 Montpellier Cedex 5, France.
Bioinformatics. 2014 Dec 15;30(24):3506-14. doi: 10.1093/bioinformatics/btu538. Epub 2014 Aug 26.
PacBio single molecule real-time sequencing is a third-generation sequencing technique producing long reads, with comparatively lower throughput and higher error rate. Errors include numerous indels and complicate downstream analysis like mapping or de novo assembly. A hybrid strategy that takes advantage of the high accuracy of second-generation short reads has been proposed for correcting long reads. Mapping of short reads on long reads provides sufficient coverage to eliminate up to 99% of errors, however, at the expense of prohibitive running times and considerable amounts of disk and memory space.
We present LoRDEC, a hybrid error correction method that builds a succinct de Bruijn graph representing the short reads, and seeks a corrective sequence for each erroneous region in the long reads by traversing chosen paths in the graph. In comparison, LoRDEC is at least six times faster and requires at least 93% less memory or disk space than available tools, while achieving comparable accuracy. Availability and implementaion: LoRDEC is written in C++, tested on Linux platforms and freely available at http://atgc.lirmm.fr/lordec.
PacBio 单分子实时测序是一种第三代测序技术,可产生长读长,但通量相对较低,错误率较高。错误包括大量的插入和缺失,这使得后续的分析(如映射或从头组装)变得复杂。已经提出了一种利用第二代短读长的高准确性的混合策略来纠正长读长。短读长在长读长上的映射提供了足够的覆盖度,可以消除高达 99%的错误,但代价是运行时间极长,以及需要大量磁盘和内存空间。
我们提出了 LoRDEC,这是一种混合纠错方法,它构建了一个简洁的 de Bruijn 图来表示短读长,并通过遍历图中的选定路径,为长读长中的每个错误区域寻找纠正序列。相比之下,LoRDEC 的速度至少快六倍,所需的内存或磁盘空间至少少 93%,而达到的准确性相当。
LoRDEC 是用 C++编写的,在 Linux 平台上进行了测试,并可在 http://atgc.lirmm.fr/lordec 上免费获取。