Suppr超能文献

动态 -PBWT:用于生物样本库规模数据的动态游程长度编码PBWT

Dynamic -PBWT: Dynamic Run-length Compressed PBWT for Biobank Scale Data.

作者信息

Shakya Pramesh, Sanaullah Ahsan, Zhi Degui, Zhang Shaojie

机构信息

Department of Computer Science, University of Central Florida , Orlando, FL, USA.

McWilliams School of Biomedical Informatics, University of Texas Health Science Center at Houston, Houston, TX, USA.

出版信息

bioRxiv. 2025 Feb 8:2025.02.04.636479. doi: 10.1101/2025.02.04.636479.

Abstract

Durbin's positional Burrows-Wheeler transform (PBWT) supports efficient haplotype matching and queries given a panel of haplotypes. It has been widely used for statistical phasing, imputation and identity-by-descent (IBD) detection. However, the original PBWT panel doesn't support dynamic updates when haplotypes need to be added or deleted from the panel. Dynamic-PBWT (d-PBWT) solved this problem but it is not memory efficient. While the memory constraint problem of the PBWT has been tackled by Syllable-PBWT and -PBWT, these are static data structures that do not allow updates. Additionally, Syllable-PBWT only supports long-match query and -PBWT only supports set-maximal match query, limiting their functionality in the compressed form. In this paper, we present Dynamic -PBWT (which can also be seen as compressed d-PBWT) that is memory efficient and supports dynamic updates. We run-length compress PBWT to achieve better compression rate and store the runs in the self-balancing trees to enable dynamic updates. We show that the number of updates per insertion or deletion in the tree at each site is constant regardless of the number of haplotypes in the panel and the updates can be made without decompressing the index. In addition, we use orders of magnitude less memory than d-PBWT. We also provide a long match query algorithm that can easily be extended back to the original -PBWT. Overall, the flexibility and space-efficiency of Dynamic -PBWT makes it a potential index data structure for biobank scale genetic data analyses. The source code for Dynamic -PBWT is available at https://github.com/ucfcbb/Dynamic-mu-PBWT.

摘要

德宾的位置布罗-惠勒变换(PBWT)在给定一组单倍型的情况下,支持高效的单倍型匹配和查询。它已被广泛用于统计定相、插补和同源性检测。然而,原始的PBWT面板在需要从面板中添加或删除单倍型时不支持动态更新。动态PBWT(d-PBWT)解决了这个问题,但它在内存使用上效率不高。虽然音节PBWT和μ-PBWT解决了PBWT的内存限制问题,但它们是不允许更新的静态数据结构。此外,音节PBWT仅支持长匹配查询,μ-PBWT仅支持集合最大匹配查询,这限制了它们在压缩形式下的功能。在本文中,我们提出了动态μ-PBWT(也可视为压缩的d-PBWT),它既节省内存又支持动态更新。我们对PBWT进行游程编码压缩以获得更好的压缩率,并将游程存储在自平衡树中以实现动态更新。我们表明,在每个位点,无论面板中单倍型的数量如何,树中每次插入或删除时的更新数量都是恒定的,并且可以在不解压缩索引的情况下进行更新。此外,我们使用的内存比d-PBWT少几个数量级。我们还提供了一种长匹配查询算法,该算法可以轻松扩展回原始的μ-PBWT。总体而言,动态μ-PBWT的灵活性和空间效率使其成为生物样本库规模遗传数据分析的潜在索引数据结构。动态μ-PBWT的源代码可在https://github.com/ucfcbb/Dynamic-mu-PBWT获取。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/d3e9/11838588/fce22ccc3326/nihpp-2025.02.04.636479v1-f0012.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验