CoLoRMap：通过映射短读段来校正长读段

CoLoRMap: Correcting Long Reads by Mapping short reads.

作者信息

Haghshenas Ehsan, Hach Faraz, Sahinalp S Cenk, Chauve Cedric

机构信息

School of Computing Sciences MADD-Gen Graduate Program, Simon Fraser University, Burnaby, BC V5A 1S6, Canada.

School of Computing Sciences Vancouver Prostate Centre, Vancouver, BC V6H 3Z6, Canada.

出版信息

Bioinformatics. 2016 Sep 1;32(17):i545-i551. doi: 10.1093/bioinformatics/btw463.

DOI:10.1093/bioinformatics/btw463

PMID:27587673

Abstract

MOTIVATION

Second generation sequencing technologies paved the way to an exceptional increase in the number of sequenced genomes, both prokaryotic and eukaryotic. However, short reads are difficult to assemble and often lead to highly fragmented assemblies. The recent developments in long reads sequencing methods offer a promising way to address this issue. However, so far long reads are characterized by a high error rate, and assembling from long reads require a high depth of coverage. This motivates the development of hybrid approaches that leverage the high quality of short reads to correct errors in long reads.

RESULTS

We introduce CoLoRMap, a hybrid method for correcting noisy long reads, such as the ones produced by PacBio sequencing technology, using high-quality Illumina paired-end reads mapped onto the long reads. Our algorithm is based on two novel ideas: using a classical shortest path algorithm to find a sequence of overlapping short reads that minimizes the edit score to a long read and extending corrected regions by local assembly of unmapped mates of mapped short reads. Our results on bacterial, fungal and insect data sets show that CoLoRMap compares well with existing hybrid correction methods.

AVAILABILITY AND IMPLEMENTATION

The source code of CoLoRMap is freely available for non-commercial use at https://github.com/sfu-compbio/colormap

CONTACT

ehaghshe@sfu.ca or cedric.chauve@sfu.ca

SUPPLEMENTARY INFORMATION

Supplementary data are available at Bioinformatics online.

摘要

动机

第二代测序技术为原核生物和真核生物测序基因组数量的显著增加铺平了道路。然而，短读段难以组装，并且常常导致高度碎片化的组装结果。长读段测序方法的最新进展为解决这一问题提供了一条有前景的途径。然而，到目前为止，长读段的特点是错误率高，并且从长读段进行组装需要高覆盖深度。这推动了混合方法的发展，该方法利用短读段的高质量来纠正长读段中的错误。

结果

我们介绍了CoLoRMap，一种用于校正有噪声长读段的混合方法，例如由PacBio测序技术产生的长读段，它使用映射到长读段上的高质量Illumina双端读段。我们的算法基于两个新颖的想法：使用经典的最短路径算法来找到重叠短读段的序列，该序列将与长读段的编辑得分最小化，并通过对映射短读段的未映射配对进行局部组装来扩展校正区域。我们在细菌、真菌和昆虫数据集上的结果表明，CoLoRMap与现有的混合校正方法相比具有优势。