Kalikar Saurabh, Jain Chirag, Vasimuddin Md, Misra Sanchit
Intel Labs, Bangalore, India.
Department of Computational and Data Sciences, Indian Institute of Science, Bangalore, India.
Nat Comput Sci. 2022 Feb;2(2):78-83. doi: 10.1038/s43588-022-00201-8. Epub 2022 Feb 28.
Long-read sequencing is now routinely used at scale for genomics and transcriptomics applications. Mapping long reads or a draft genome assembly to a reference sequence is often one of the most time-consuming steps in these applications. Here we present techniques to accelerate minimap2, a widely used software for this task. We present multiple optimizations using single-instruction multiple-data parallelization, efficient cache utilization and a learned index data structure to accelerate the three main computational modules of minimap2: seeding, chaining and pairwise sequence alignment. These optimizations result in an up to 1.8-fold reduction of end-to-end mapping time of minimap2 while maintaining identical output.
长读长测序现在已在基因组学和转录组学应用中大规模常规使用。将长读长或基因组草图组装映射到参考序列通常是这些应用中最耗时的步骤之一。在此,我们展示了加速minimap2的技术,minimap2是用于此任务的广泛使用的软件。我们使用单指令多数据并行化、高效缓存利用和学习索引数据结构提出了多种优化方法,以加速minimap2的三个主要计算模块:种子比对、链接和成对序列比对。这些优化使minimap2的端到端映射时间最多减少1.8倍,同时保持输出相同。