Department of Computational and Data Sciences, Indian Institute of Science, Bangalore Karnataka 560012, India.
Department of Computer Science, The University of Texas at Dallas, Richardson, Texas 75080, USA.
Genome Res. 2024 Oct 11;34(9):1265-1275. doi: 10.1101/gr.279143.124.
Modern pangenome graphs are built using haplotype-resolved genome assemblies. When mapping reads to a pangenome graph, prioritizing alignments that are consistent with the known haplotypes improves genotyping accuracy. However, the existing rigorous formulations for colinear chaining and alignment problems do not consider the haplotype paths in a pangenome graph. This often leads to spurious read alignments to those paths that are unlikely recombinations of the known haplotypes. In this paper, we develop novel formulations and algorithms for sequence-to-graph alignment and chaining problems. Inspired by the genotype imputation models, we assume that a query sequence is an imperfect mosaic of reference haplotypes. Accordingly, we introduce a recombination penalty in the scoring functions for each haplotype switch. First, we solve haplotype-aware sequence-to-graph alignment in [Formula: see text] time, where is the query sequence, is the set of edges, and H is the set of haplotypes represented in the graph. To complement our solution, we prove that an algorithm significantly faster than [Formula: see text] is impossible under the strong exponential time hypothesis (SETH). Second, we propose a haplotype-aware chaining algorithm that runs in [Formula: see text] time after graph preprocessing, where is the count of input anchors. We then establish that a chaining algorithm significantly faster than [Formula: see text] is impossible under SETH. As a proof-of-concept, we implemented our chaining algorithm in the Minichain aligner. By aligning sequences sampled from the human major histocompatibility complex (MHC) to a pangenome graph of 60 MHC haplotypes, we demonstrate that our algorithm achieves better consistency with ground-truth recombinations compared with a haplotype-agnostic algorithm.
现代泛基因组图是使用单倍型解析基因组组装构建的。在将读取映射到泛基因组图时,优先考虑与已知单倍型一致的对齐可以提高基因分型准确性。然而,现有的严格的共线性连锁和对齐问题的公式化并没有考虑泛基因组图中的单倍型路径。这通常会导致虚假的读取对齐到那些不太可能是已知单倍型重组的路径。在本文中,我们为序列到图的对齐和连锁问题开发了新的公式和算法。受基因型推断模型的启发,我们假设查询序列是参考单倍型的不完美镶嵌。因此,我们在每个单倍型开关的评分函数中引入了重组惩罚。首先,我们在 [公式:见文本] 时间内解决了单倍型感知的序列到图的对齐问题,其中 是查询序列, 是边的集合,H 是图中表示的单倍型集合。为了补充我们的解决方案,我们证明在强指数时间假设 (SETH) 下,比 [公式:见文本] 快的算法是不可能的。其次,我们提出了一种单倍型感知的连锁算法,在图预处理后运行时间为 [公式:见文本],其中 是输入锚点的计数。然后,我们证明在 SETH 下,比 [公式:见文本] 快的连锁算法是不可能的。作为概念验证,我们在 Minichain 对齐器中实现了我们的连锁算法。通过将从人类主要组织相容性复合体 (MHC) 中采样的序列与 60 个 MHC 单倍型的泛基因组图对齐,我们证明与单倍型无关的算法相比,我们的算法与地面真相重组具有更好的一致性。