Suppr超能文献

基于音节的 PBWT 用于高效空间的单倍型长匹配查询。

Syllable-PBWT for space-efficient haplotype long-match query.

机构信息

School of Biomedical Informatics, University of Texas Health Science Center at Houston, Houston, TX 77030, USA.

Department of Computer Science, University of Central Florida, Orlando, FL 32816, USA.

出版信息

Bioinformatics. 2023 Jan 1;39(1). doi: 10.1093/bioinformatics/btac734.

Abstract

MOTIVATION

The positional Burrows-Wheeler transform (PBWT) has led to tremendous strides in haplotype matching on biobank-scale data. For genetic genealogical search, PBWT-based methods have optimized the asymptotic runtime of finding long matches between a query haplotype and a predefined panel of haplotypes. However, to enable fast query searches, the full-sized panel and PBWT data structures must be kept in memory, preventing existing algorithms from scaling up to modern biobank panels consisting of millions of haplotypes. In this work, we propose a space-efficient variation of PBWT named Syllable-PBWT, which divides every haplotype into syllables, builds the PBWT positional prefix arrays on the compressed syllabic panel, and leverages the polynomial rolling hash function for positional substring comparison. With the Syllable-PBWT data structures, we then present a long match query algorithm named Syllable-Query.

RESULTS

Compared to the most time- and space-efficient previously published solution to the long match query problem, Syllable-Query reduced the memory use by a factor of over 100 on both the UK Biobank genotype data and the 1000 Genomes Project sequence data. Surprisingly, the smaller size of our syllabic data structures allows for more efficient iteration and CPU cache usage, granting Syllable-Query even faster runtime than existing solutions.

AVAILABILITY AND IMPLEMENTATION

https://github.com/ZhiGroup/Syllable-PBWT.

SUPPLEMENTARY INFORMATION

Supplementary data are available at Bioinformatics online.

摘要

动机

位置 Burrows-Wheeler 变换(PBWT)在生物库规模数据的单倍型匹配方面取得了巨大进展。对于遗传谱系搜索,基于 PBWT 的方法优化了在查询单倍型和预定义的单倍型面板之间找到长匹配的渐近运行时间。然而,为了实现快速查询搜索,必须将完整大小的面板和 PBWT 数据结构保留在内存中,这使得现有的算法无法扩展到由数百万个单倍型组成的现代生物库面板。在这项工作中,我们提出了一种名为音节 PBWT 的 PBWT 的空间高效变体,它将每个单倍型划分为音节,在压缩的音节面板上构建 PBWT 位置前缀数组,并利用多项式滚动哈希函数进行位置子串比较。然后,我们使用音节 PBWT 数据结构提出了一种名为音节查询的长匹配查询算法。

结果

与长匹配查询问题最节省时间和空间的先前发布的解决方案相比,音节查询在英国生物库基因型数据和 1000 基因组计划序列数据上分别将内存使用减少了 100 多倍。令人惊讶的是,我们的音节数据结构更小,允许更有效的迭代和 CPU 缓存使用,使得音节查询甚至比现有解决方案更快的运行时间。

可用性和实现

https://github.com/ZhiGroup/Syllable-PBWT。

补充信息

补充数据可在生物信息学在线获得。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/29d2/9805553/ca000990691e/btac734f1.jpg

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验