School of Biomedical Informatics, University of Texas Health Science Center at Houston, Houston, TX 77030, USA.
Department of Computer Science, University of Central Florida, Orlando, FL 32816, USA.
Bioinformatics. 2023 Jan 1;39(1). doi: 10.1093/bioinformatics/btac734.
The positional Burrows-Wheeler transform (PBWT) has led to tremendous strides in haplotype matching on biobank-scale data. For genetic genealogical search, PBWT-based methods have optimized the asymptotic runtime of finding long matches between a query haplotype and a predefined panel of haplotypes. However, to enable fast query searches, the full-sized panel and PBWT data structures must be kept in memory, preventing existing algorithms from scaling up to modern biobank panels consisting of millions of haplotypes. In this work, we propose a space-efficient variation of PBWT named Syllable-PBWT, which divides every haplotype into syllables, builds the PBWT positional prefix arrays on the compressed syllabic panel, and leverages the polynomial rolling hash function for positional substring comparison. With the Syllable-PBWT data structures, we then present a long match query algorithm named Syllable-Query.
Compared to the most time- and space-efficient previously published solution to the long match query problem, Syllable-Query reduced the memory use by a factor of over 100 on both the UK Biobank genotype data and the 1000 Genomes Project sequence data. Surprisingly, the smaller size of our syllabic data structures allows for more efficient iteration and CPU cache usage, granting Syllable-Query even faster runtime than existing solutions.
https://github.com/ZhiGroup/Syllable-PBWT.
Supplementary data are available at Bioinformatics online.
位置 Burrows-Wheeler 变换(PBWT)在生物库规模数据的单倍型匹配方面取得了巨大进展。对于遗传谱系搜索,基于 PBWT 的方法优化了在查询单倍型和预定义的单倍型面板之间找到长匹配的渐近运行时间。然而,为了实现快速查询搜索,必须将完整大小的面板和 PBWT 数据结构保留在内存中,这使得现有的算法无法扩展到由数百万个单倍型组成的现代生物库面板。在这项工作中,我们提出了一种名为音节 PBWT 的 PBWT 的空间高效变体,它将每个单倍型划分为音节,在压缩的音节面板上构建 PBWT 位置前缀数组,并利用多项式滚动哈希函数进行位置子串比较。然后,我们使用音节 PBWT 数据结构提出了一种名为音节查询的长匹配查询算法。
与长匹配查询问题最节省时间和空间的先前发布的解决方案相比,音节查询在英国生物库基因型数据和 1000 基因组计划序列数据上分别将内存使用减少了 100 多倍。令人惊讶的是,我们的音节数据结构更小,允许更有效的迭代和 CPU 缓存使用,使得音节查询甚至比现有解决方案更快的运行时间。
https://github.com/ZhiGroup/Syllable-PBWT。
补充数据可在生物信息学在线获得。