Suppr超能文献

使用德布鲁因图对长读段中的错误进行准确的自我校正。

Accurate self-correction of errors in long reads using de Bruijn graphs.

作者信息

Salmela Leena, Walve Riku, Rivals Eric, Ukkonen Esko

机构信息

Helsinki Institute for Information Technology HIIT, Department of Computer Science, University of Helsinki, Helsinki, Finland.

LIRMM and Institut de Biologie Computationelle, CNRS and Université Montpellier, Montpellier, France.

出版信息

Bioinformatics. 2017 Mar 15;33(6):799-806. doi: 10.1093/bioinformatics/btw321.

Abstract

MOTIVATION

New long read sequencing technologies, like PacBio SMRT and Oxford NanoPore, can produce sequencing reads up to 50 000 bp long but with an error rate of at least 15%. Reducing the error rate is necessary for subsequent utilization of the reads in, e.g. de novo genome assembly. The error correction problem has been tackled either by aligning the long reads against each other or by a hybrid approach that uses the more accurate short reads produced by second generation sequencing technologies to correct the long reads.

RESULTS

We present an error correction method that uses long reads only. The method consists of two phases: first, we use an iterative alignment-free correction method based on de Bruijn graphs with increasing length of k -mers, and second, the corrected reads are further polished using long-distance dependencies that are found using multiple alignments. According to our experiments, the proposed method is the most accurate one relying on long reads only for read sets with high coverage. Furthermore, when the coverage of the read set is at least 75×, the throughput of the new method is at least 20% higher.

AVAILABILITY AND IMPLEMENTATION

LoRMA is freely available at http://www.cs.helsinki.fi/u/lmsalmel/LoRMA/ .

CONTACT

leena.salmela@cs.helsinki.fi.

摘要

动机

新的长读长测序技术,如PacBio SMRT和牛津纳米孔技术,能够产生长达50000bp的测序读段,但错误率至少为15%。降低错误率对于后续将这些读段用于例如从头基因组组装等应用是必要的。错误校正问题要么通过将长读段相互比对来解决,要么通过一种混合方法来解决,该方法利用第二代测序技术产生的更准确的短读段来校正长读段。

结果

我们提出了一种仅使用长读段的错误校正方法。该方法包括两个阶段:首先,我们使用一种基于德布鲁因图的迭代无比对校正方法,其中k-mer的长度不断增加;其次,使用通过多重比对找到的长距离依赖性对校正后的读段进行进一步优化。根据我们的实验,对于高覆盖度的读段集,所提出的方法是仅依赖长读段的最准确方法。此外,当读段集的覆盖度至少为75×时,新方法的通量至少高20%。

可用性和实现

LoRMA可在http://www.cs.helsinki.fi/u/lmsalmel/LoRMA/免费获取。

联系方式

leena.salmela@cs.helsinki.fi

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验