Turki Turki, Roshan Usman
Computer Science Department, King Abdulaziz University, P,O, Box 80221 Jeddah, Saudi Arabia.
BMC Genomics. 2014 Nov 15;15(1):969. doi: 10.1186/1471-2164-15-969.
Programs based on hash tables and Burrows-Wheeler are very fast for mapping short reads to genomes but have low accuracy in the presence of mismatches and gaps. Such reads can be aligned accurately with the Smith-Waterman algorithm but it can take hours and days to map millions of reads even for bacteria genomes.
We introduce a GPU program called MaxSSmap with the aim of achieving comparable accuracy to Smith-Waterman but with faster runtimes. Similar to most programs MaxSSmap identifies a local region of the genome followed by exact alignment. Instead of using hash tables or Burrows-Wheeler in the first part, MaxSSmap calculates maximum scoring subsequence score between the read and disjoint fragments of the genome in parallel on a GPU and selects the highest scoring fragment for exact alignment. We evaluate MaxSSmap's accuracy and runtime when mapping simulated Illumina E.coli and human chromosome one reads of different lengths and 10% to 30% mismatches with gaps to the E.coli genome and human chromosome one. We also demonstrate applications on real data by mapping ancient horse DNA reads to modern genomes and unmapped paired reads from NA12878 in 1000 genomes.
We show that MaxSSmap attains comparable high accuracy and low error to fast Smith-Waterman programs yet has much lower runtimes. We show that MaxSSmap can map reads rejected by BWA and NextGenMap with high accuracy and low error much faster than if Smith-Waterman were used. On short read lengths of 36 and 51 both MaxSSmap and Smith-Waterman have lower accuracy compared to at higher lengths. On real data MaxSSmap produces many alignments with high score and mapping quality that are not given by NextGenMap and BWA. The MaxSSmap source code in CUDA and OpenCL is freely available from http://www.cs.njit.edu/usman/MaxSSmap.
基于哈希表和布罗-惠勒变换(Burrows-Wheeler)的程序在将短读段映射到基因组时速度非常快,但在存在错配和缺口的情况下准确性较低。这样的读段可以使用史密斯-沃特曼算法(Smith-Waterman)进行精确比对,但即使对于细菌基因组,映射数百万个读段也可能需要数小时甚至数天。
我们引入了一个名为MaxSSmap的GPU程序,目的是实现与史密斯-沃特曼算法相当的准确性,但运行时间更快。与大多数程序类似,MaxSSmap先识别基因组的一个局部区域,然后进行精确比对。在第一部分中,MaxSSmap不是使用哈希表或布罗-惠勒变换,而是在GPU上并行计算读段与基因组不相交片段之间的最大得分子序列得分,并选择得分最高的片段进行精确比对。我们评估了MaxSSmap在将不同长度且有10%至30%错配和缺口的模拟Illumina大肠杆菌和人类1号染色体读段映射到大肠杆菌基因组和人类1号染色体时的准确性和运行时间。我们还通过将古代马DNA读段映射到现代基因组以及1000基因组计划中NA12878的未映射配对读段展示了在真实数据上的应用。
我们表明MaxSSmap实现了与快速史密斯-沃特曼程序相当的高精度和低错误率,同时运行时间要低得多。我们表明MaxSSmap能够以比使用史密斯-沃特曼算法快得多的速度,高精度、低错误地映射被BWA和NextGenMap拒绝的读段。在36和51的短读长情况下,与较长读长相比,MaxSSmap和史密斯-沃特曼算法的准确性都较低。在真实数据上,MaxSSmap产生了许多高分且映射质量高的比对结果,而NextGenMap和BWA并未给出这些结果。MaxSSmap的CUDA和OpenCL源代码可从http://www.cs.njit.edu/usman/MaxSSmap免费获取。