Salmela Leena, Walve Riku, Rivals Eric, Ukkonen Esko
Helsinki Institute for Information Technology HIIT, Department of Computer Science, University of Helsinki, Helsinki, Finland.
LIRMM and Institut de Biologie Computationelle, CNRS and Université Montpellier, Montpellier, France.
Bioinformatics. 2017 Mar 15;33(6):799-806. doi: 10.1093/bioinformatics/btw321.
New long read sequencing technologies, like PacBio SMRT and Oxford NanoPore, can produce sequencing reads up to 50 000 bp long but with an error rate of at least 15%. Reducing the error rate is necessary for subsequent utilization of the reads in, e.g. de novo genome assembly. The error correction problem has been tackled either by aligning the long reads against each other or by a hybrid approach that uses the more accurate short reads produced by second generation sequencing technologies to correct the long reads.
We present an error correction method that uses long reads only. The method consists of two phases: first, we use an iterative alignment-free correction method based on de Bruijn graphs with increasing length of k -mers, and second, the corrected reads are further polished using long-distance dependencies that are found using multiple alignments. According to our experiments, the proposed method is the most accurate one relying on long reads only for read sets with high coverage. Furthermore, when the coverage of the read set is at least 75×, the throughput of the new method is at least 20% higher.
LoRMA is freely available at http://www.cs.helsinki.fi/u/lmsalmel/LoRMA/ .
新的长读长测序技术,如PacBio SMRT和牛津纳米孔技术,能够产生长达50000bp的测序读段,但错误率至少为15%。降低错误率对于后续将这些读段用于例如从头基因组组装等应用是必要的。错误校正问题要么通过将长读段相互比对来解决,要么通过一种混合方法来解决,该方法利用第二代测序技术产生的更准确的短读段来校正长读段。
我们提出了一种仅使用长读段的错误校正方法。该方法包括两个阶段:首先,我们使用一种基于德布鲁因图的迭代无比对校正方法,其中k-mer的长度不断增加;其次,使用通过多重比对找到的长距离依赖性对校正后的读段进行进一步优化。根据我们的实验,对于高覆盖度的读段集,所提出的方法是仅依赖长读段的最准确方法。此外,当读段集的覆盖度至少为75×时,新方法的通量至少高20%。
LoRMA可在http://www.cs.helsinki.fi/u/lmsalmel/LoRMA/免费获取。