Sanaullah Ahsan, Zhi Degui, Zhang Shaojie
Department of Computer Science, University of Central Florida, Orlando, FL 32816, USA.
Center for AI and Genome Informatics, School of Biomedical Informatics, University of Texas Health Science Center at Houston, Houston, TX 77030, USA.
bioRxiv. 2023 Jan 6:2023.01.04.522803. doi: 10.1101/2023.01.04.522803.
The Li & Stephens (LS) hidden Markov model (HMM) models the process of reconstructing a haplotype as a mosaic copy of haplotypes in a reference panel (haplotype threading). For small panels the probabilistic parameterization of LS enables modeling the uncertainties of such mosaics, and has been the foundational model for haplotype phasing and imputation. However, LS becomes inefficient when sample size is large (tens of thousands to millions), because of its linear time complexity ( ( ), where is the number of haplotypes and is the number of sites in the panel). Recently the PBWT, an efficient data structure capturing the local haplotype matching among haplotypes, was proposed to offer fast methods for giving some optimal solution (Viterbi) to the LS HMM. But the solution space of the LS for large panels is still elusive. Previously we introduced the Minimal Positional Substring Cover (MPSC) problem as an alternative formulation of LS whose objective is to cover a query haplotype by a minimum number of segments from haplotypes in a reference panel. The MPSC formulation allows the generation of a haplotype threading in time constant to sample size ( ( )). This allows haplotype threading on very large biobank scale panels on which the LS model is infeasible. Here we present new results on the solution space of the MPSC by first identifying a property that any MPSC will have a set of required regions, and then proposing a MPSC graph. In addition, we derived a number of optimal algorithms for MPSC, including solution enumerations, the Length Maximal MPSC, and -MPSC solutions. In doing so, our algorithms reveal the solution space of LS for large panels. Even though we only solved an extreme case of LS where the emission probability is 0, our algorithms can be made more robust by PBWT smoothing. We show that our method is informative in terms of revealing the characteristics of biobank-scale data sets and can improve genotype imputation.
李和斯蒂芬斯(LS)隐马尔可夫模型(HMM)将单倍型重建过程建模为参考面板中单倍型的镶嵌拷贝(单倍型穿线)。对于小面板,LS的概率参数化能够对这种镶嵌的不确定性进行建模,并且一直是单倍型定相和插补的基础模型。然而,当样本量很大(数万到数百万)时,LS会变得效率低下,因为其时间复杂度是线性的( ( ),其中 是单倍型的数量, 是面板中位点的数量)。最近,提出了一种有效的数据结构PBWT,它捕获单倍型之间的局部单倍型匹配,以提供快速方法来为LS HMM给出一些最优解(维特比)。但是,大面板的LS解空间仍然难以捉摸。之前我们引入了最小位置子串覆盖(MPSC)问题,作为LS的一种替代形式,其目标是用参考面板中单倍型的最少片段覆盖查询单倍型。MPSC公式允许在与样本量成常数时间内生成单倍型穿线( ( ))。这使得在非常大的生物样本库规模面板上进行单倍型穿线成为可能,而LS模型在这些面板上是不可行的。在这里,我们通过首先识别任何MPSC都将具有一组所需区域的属性,然后提出一个MPSC图,给出了关于MPSC解空间的新结果。此外,我们为MPSC推导了许多最优算法,包括解枚举、长度最大MPSC和 -MPSC解。通过这样做,我们的算法揭示了大面板的LS解空间。尽管我们只解决了LS的一种极端情况,即发射概率为0的情况,但我们的算法可以通过PBWT平滑变得更稳健。我们表明,我们的方法在揭示生物样本库规模数据集的特征方面具有信息价值,并且可以改进基因型插补。