最小位置子串覆盖：一种替代李和斯蒂芬斯模型的单倍型穿线法

Minimal Positional Substring Cover: A Haplotype Threading Alternative to Li & Stephens Model.

作者信息

Sanaullah Ahsan, Zhi Degui, Zhang Shaojie

机构信息

Department of Computer Science, University of Central Florida, Orlando, FL 32816, USA.

Center for AI and Genome Informatics, School of Biomedical Informatics, University of Texas Health Science Center at Houston, Houston, TX 77030, USA.

出版信息

bioRxiv. 2023 Jan 6:2023.01.04.522803. doi: 10.1101/2023.01.04.522803.

DOI:10.1101/2023.01.04.522803

PMID:36711469

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC9881975/

Abstract

The Li & Stephens (LS) hidden Markov model (HMM) models the process of reconstructing a haplotype as a mosaic copy of haplotypes in a reference panel (haplotype threading). For small panels the probabilistic parameterization of LS enables modeling the uncertainties of such mosaics, and has been the foundational model for haplotype phasing and imputation. However, LS becomes inefficient when sample size is large (tens of thousands to millions), because of its linear time complexity ( ( ), where is the number of haplotypes and is the number of sites in the panel). Recently the PBWT, an efficient data structure capturing the local haplotype matching among haplotypes, was proposed to offer fast methods for giving some optimal solution (Viterbi) to the LS HMM. But the solution space of the LS for large panels is still elusive. Previously we introduced the Minimal Positional Substring Cover (MPSC) problem as an alternative formulation of LS whose objective is to cover a query haplotype by a minimum number of segments from haplotypes in a reference panel. The MPSC formulation allows the generation of a haplotype threading in time constant to sample size ( ( )). This allows haplotype threading on very large biobank scale panels on which the LS model is infeasible. Here we present new results on the solution space of the MPSC by first identifying a property that any MPSC will have a set of required regions, and then proposing a MPSC graph. In addition, we derived a number of optimal algorithms for MPSC, including solution enumerations, the Length Maximal MPSC, and -MPSC solutions. In doing so, our algorithms reveal the solution space of LS for large panels. Even though we only solved an extreme case of LS where the emission probability is 0, our algorithms can be made more robust by PBWT smoothing. We show that our method is informative in terms of revealing the characteristics of biobank-scale data sets and can improve genotype imputation.

摘要

李和斯蒂芬斯（LS）隐马尔可夫模型（HMM）将单倍型重建过程建模为参考面板中单倍型的镶嵌拷贝（单倍型穿线）。对于小面板，LS的概率参数化能够对这种镶嵌的不确定性进行建模，并且一直是单倍型定相和插补的基础模型。然而，当样本量很大（数万到数百万）时，LS会变得效率低下，因为其时间复杂度是线性的（（），其中是单倍型的数量，是面板中位点的数量）。最近，提出了一种有效的数据结构PBWT，它捕获单倍型之间的局部单倍型匹配，以提供快速方法来为LS HMM给出一些最优解（维特比）。但是，大面板的LS解空间仍然难以捉摸。之前我们引入了最小位置子串覆盖（MPSC）问题，作为LS的一种替代形式，其目标是用参考面板中单倍型的最少片段覆盖查询单倍型。MPSC公式允许在与样本量成常数时间内生成单倍型穿线（（））。这使得在非常大的生物样本库规模面板上进行单倍型穿线成为可能，而LS模型在这些面板上是不可行的。在这里，我们通过首先识别任何MPSC都将具有一组所需区域的属性，然后提出一个MPSC图，给出了关于MPSC解空间的新结果。此外，我们为MPSC推导了许多最优算法，包括解枚举、长度最大MPSC和 -MPSC解。通过这样做，我们的算法揭示了大面板的LS解空间。尽管我们只解决了LS的一种极端情况，即发射概率为0的情况，但我们的算法可以通过PBWT平滑变得更稳健。我们表明，我们的方法在揭示生物样本库规模数据集的特征方面具有信息价值，并且可以改进基因型插补。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2256/9881975/06a591c27d85/nihpp-2023.01.04.522803v1-f0002.jpg

相似文献

Minimal Positional Substring Cover: A Haplotype Threading Alternative to Li & Stephens Model.最小位置子串覆盖：一种替代李和斯蒂芬斯模型的单倍型穿线法

bioRxiv. 2023 Jan 6:2023.01.04.522803. doi: 10.1101/2023.01.04.522803.

Minimal positional substring cover is a haplotype threading alternative to Li and Stephens model.最小位置子串覆盖是替代 Li 和 Stephens 模型的单倍型连接方法。

Genome Res. 2023 Jul;33(7):1007-1014. doi: 10.1101/gr.277673.123. Epub 2023 Jun 14.

Syllable-PBWT for space-efficient haplotype long-match query.基于音节的 PBWT 用于高效空间的单倍型长匹配查询。

Bioinformatics. 2023 Jan 1;39(1). doi: 10.1093/bioinformatics/btac734.

P-smoother: efficient PBWT smoothing of large haplotype panels.P-平滑器：对大型单倍型面板进行高效的基于位置的小波变换平滑处理

Bioinform Adv. 2022 Jun 20;2(1):vbac045. doi: 10.1093/bioadv/vbac045. eCollection 2022.

Efficient haplotype matching between a query and a panel for genealogical search.针对系谱搜索，查询与面板之间的高效单倍型匹配。

Bioinformatics. 2019 Jul 15;35(14):i233-i241. doi: 10.1093/bioinformatics/btz347.

The solution surface of the Li-Stephens haplotype copying model.李-斯蒂芬斯单倍型复制模型的解曲面。

Algorithms Mol Biol. 2023 Aug 9;18(1):12. doi: 10.1186/s13015-023-00237-z.

Efficient haplotype matching and storage using the positional Burrows-Wheeler transform (PBWT).利用位置 Burrows-Wheeler 变换 (PBWT) 实现高效单倍型匹配和存储。

Bioinformatics. 2014 May 1;30(9):1266-72. doi: 10.1093/bioinformatics/btu014. Epub 2014 Jan 9.

d-PBWT: dynamic positional Burrows-Wheeler transform.d-PBWT：动态位置布罗算法变换

Bioinformatics. 2021 Aug 25;37(16):2390-2397. doi: 10.1093/bioinformatics/btab117.

μ- PBWT: a lightweight r-indexing of the PBWT for storing and querying UK Biobank data.μ-PBWT：用于存储和查询 UK Biobank 数据的轻量级 PBWT r-索引。

Bioinformatics. 2023 Sep 2;39(9). doi: 10.1093/bioinformatics/btad552.

Genotype imputation using the Positional Burrows Wheeler Transform.基于位置的 Burrows-Wheeler 变换的基因型推断。

PLoS Genet. 2020 Nov 16;16(11):e1009049. doi: 10.1371/journal.pgen.1009049. eCollection 2020 Nov.

本文引用的文献

P-smoother: efficient PBWT smoothing of large haplotype panels.P-平滑器：对大型单倍型面板进行高效的基于位置的小波变换平滑处理

Bioinform Adv. 2022 Jun 20;2(1):vbac045. doi: 10.1093/bioadv/vbac045. eCollection 2022.

Fast two-stage phasing of large-scale sequence data.大规模序列数据的快速两阶段相位测定。

Am J Hum Genet. 2021 Oct 7;108(10):1880-1890. doi: 10.1016/j.ajhg.2021.08.005. Epub 2021 Sep 2.

d-PBWT: dynamic positional Burrows-Wheeler transform.d-PBWT：动态位置布罗算法变换

Bioinformatics. 2021 Aug 25;37(16):2390-2397. doi: 10.1093/bioinformatics/btab117.

Genotype imputation using the Positional Burrows Wheeler Transform.基于位置的 Burrows-Wheeler 变换的基因型推断。

PLoS Genet. 2020 Nov 16;16(11):e1009049. doi: 10.1371/journal.pgen.1009049. eCollection 2020 Nov.

Accurate, scalable and integrative haplotype estimation.精确、可扩展且综合的单倍型估计。

Nat Commun. 2019 Nov 28;10(1):5436. doi: 10.1038/s41467-019-13225-y.

Multi-allelic positional Burrows-Wheeler transform.多等位基因位置 Burrows-Wheeler 变换。

BMC Bioinformatics. 2019 Jun 6;20(Suppl 11):279. doi: 10.1186/s12859-019-2821-6.

An average-case sublinear forward algorithm for the haploid Li and Stephens model.用于单倍体李和斯蒂芬斯模型的平均情况次线性前向算法。

Algorithms Mol Biol. 2019 Apr 2;14:11. doi: 10.1186/s13015-019-0144-9. eCollection 2019.

The UK Biobank resource with deep phenotyping and genomic data.英国生物银行资源库，具有深度表型和基因组数据。

Nature. 2018 Oct;562(7726):203-209. doi: 10.1038/s41586-018-0579-z. Epub 2018 Oct 10.

Haplotype matching in large cohorts using the Li and Stephens model.利用李和斯蒂芬斯模型在大样本中进行单体型匹配。

Bioinformatics. 2019 Mar 1;35(5):798-806. doi: 10.1093/bioinformatics/bty735.

A One-Penny Imputed Genome from Next-Generation Reference Panels.基于新一代参考面板的单分钱估算基因组。

Am J Hum Genet. 2018 Sep 6;103(3):338-348. doi: 10.1016/j.ajhg.2018.07.015. Epub 2018 Aug 9.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

最小位置子串覆盖：一种替代李和斯蒂芬斯模型的单倍型穿线法

Minimal Positional Substring Cover: A Haplotype Threading Alternative to Li & Stephens Model.

作者信息

机构信息

出版信息

相似文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

本文引用的文献