Wei Ze-Gang, Fan Xing-Guo, Zhang Hao, Zhang Xiao-Dan, Liu Fei, Qian Yu, Zhang Shao-Wu
Institute of Physics and Optoelectronics Technology, Baoji University of Arts and Sciences, Baoji, China.
Key Laboratory of Information Fusion Technology of Ministry of Education, School of Automation, Northwestern Polytechnical University, Xi'an, China.
Front Genet. 2022 May 5;13:890651. doi: 10.3389/fgene.2022.890651. eCollection 2022.
With the rapid development of single molecular sequencing (SMS) technologies such as PacBio single-molecule real-time and Oxford Nanopore sequencing, the output read length is continuously increasing, which has dramatical potentials on cutting-edge genomic applications. Mapping these reads to a reference genome is often the most fundamental and computing-intensive step for downstream analysis. However, these long reads contain higher sequencing errors and could more frequently span the breakpoints of structural variants (SVs) than those of shorter reads, leading to many unaligned reads or reads that are partially aligned for most state-of-the-art mappers. As a result, these methods usually focus on producing local mapping results for the query read rather than obtaining the whole end-to-end alignment. We introduce kngMap, a novel -mer neighborhood graph-based mapper that is specifically designed to align long noisy SMS reads to a reference sequence. By benchmarking exhaustive experiments on both simulated and real-life SMS datasets to assess the performance of kngMap with ten other popular SMS mapping tools (e.g., BLASR, BWA-MEM, and minimap2), we demonstrated that kngMap has higher sensitivity that can align more reads and bases to the reference genome; meanwhile, kngMap can produce consecutive alignments for the whole read and span different categories of SVs in the reads. kngMap is implemented in C++ and supports multi-threading; the source code of kngMap can be downloaded for free at: https://github.com/zhang134/kngMap for academic usage.
随着诸如PacBio单分子实时测序和牛津纳米孔测序等单分子测序(SMS)技术的快速发展,输出读长不断增加,这在前沿基因组应用方面具有巨大潜力。将这些读段比对到参考基因组通常是下游分析最基本且计算量最大的步骤。然而,这些长读段包含更高的测序错误,并且比短读段更频繁地跨越结构变异(SV)的断点,导致对于大多数最先进的比对工具来说,有许多未比对上的读段或部分比对上的读段。因此,这些方法通常专注于为查询读段生成局部比对结果,而不是获得完整的端到端比对。我们介绍了kngMap,一种基于新颖的-mer邻域图的比对工具,它专门设计用于将有噪声的长SMS读段比对到参考序列。通过在模拟和真实的SMS数据集上进行详尽实验来基准测试kngMap与其他十种流行的SMS比对工具(例如BLASR、BWA-MEM和minimap2)的性能,我们证明kngMap具有更高的灵敏度,能够将更多读段和碱基比对到参考基因组;同时,kngMap可以为整个读段生成连续比对,并跨越读段中不同类别的SV。kngMap用C++实现并支持多线程;kngMap的源代码可在https://github.com/zhang134/kngMap免费下载以供学术使用。