Joshi Dhaivat, Diggavi Suhas, Chaisson Mark J P, Kannan Sreeram
University of California, Los Angeles.
Department of Quantitative and Computational Biology, University of Southern California, Los Angeles.
bioRxiv. 2023 Jan 9:2023.01.08.523172. doi: 10.1101/2023.01.08.523172.
Detection of structural variants (SV) from the alignment of sample DNA reads to the reference genome is an important problem in understanding human diseases. Long reads that can span repeat regions, along with an accurate alignment of these long reads play an important role in identifying novel SVs. Long read sequencers such as nanopore sequencing can address this problem by providing very long reads but with high error rates, making accurate alignment challenging. Many errors induced by nanopore sequencing have a bias because of the physics of the sequencing process and proper utilization of these error characteristics can play an important role in designing a robust aligner for SV detection problems. In this paper, we design and evaluate HQAlign, an aligner for SV detection using nanopore sequenced reads. The key ideas of HQAlign include (i) using basecalled nanopore reads along with the nanopore physics to improve alignments for SVs (ii) incorporating SV specific changes to the alignment pipeline (iii) adapting these into existing state-of-the-art long read aligner pipeline, minimap2 (v2.24), for efficient alignments.
We show that HQAlign captures about 4 - 6% complementary SVs across different datasets which are missed by minimap2 alignments while having a standalone performance at par with minimap2 for real nanopore reads data. For the common SV calls between HQAlign and minimap2, HQAlign improves the start and the end breakpoint accuracy for about 10 - 50% of SVs across different datasets. Moreover, HQAlign improves the alignment rate to 89.35% from minimap2 85.64% for nanopore reads alignment to recent telomere-to-telomere CHM13 assembly, and it improves to 86.65% from 83.48% for nanopore reads alignment to GRCh37 human genome.
从样本DNA读数与参考基因组的比对中检测结构变异(SV)是理解人类疾病的一个重要问题。能够跨越重复区域的长读数以及这些长读数的精确比对在识别新型SV中起着重要作用。诸如纳米孔测序之类的长读数测序仪可以通过提供非常长的读数来解决这个问题,但错误率很高,这使得精确比对具有挑战性。由于测序过程的物理特性,纳米孔测序引发的许多错误存在偏差,而正确利用这些错误特征在设计用于SV检测问题的强大比对器中可以发挥重要作用。在本文中,我们设计并评估了HQAlign,一种使用纳米孔测序读数进行SV检测的比对器。HQAlign的关键思想包括:(i)使用碱基识别的纳米孔读数以及纳米孔物理特性来改进SV的比对;(ii)将特定于SV的变化纳入比对流程;(iii)将这些调整应用于现有的最先进的长读数比对器流程minimap2(v2.24),以实现高效比对。
我们表明,HQAlign在不同数据集中捕获了约4 - 6%的互补SV,这些SV被minimap2比对遗漏,同时对于真实的纳米孔读数数据,其独立性能与minimap2相当。对于HQAlign和minimap2之间的常见SV调用,HQAlign在不同数据集中将约10 - 50%的SV的起始和末端断点准确性提高。此外,对于纳米孔读数与最近的端粒到端粒CHM13组装的比对,HQAlign将比对率从minimap2的85.64%提高到89.35%,对于纳米孔读数与GRCh37人类基因组的比对,比对率从83.48%提高到86.65%。