基于音节的 PBWT 用于高效空间的单倍型长匹配查询。

Syllable-PBWT for space-efficient haplotype long-match query.

机构信息

School of Biomedical Informatics, University of Texas Health Science Center at Houston, Houston, TX 77030, USA.

Department of Computer Science, University of Central Florida, Orlando, FL 32816, USA.

出版信息

Bioinformatics. 2023 Jan 1;39(1). doi: 10.1093/bioinformatics/btac734.

DOI:10.1093/bioinformatics/btac734

PMID:36440908

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC9805553/

Abstract

MOTIVATION

The positional Burrows-Wheeler transform (PBWT) has led to tremendous strides in haplotype matching on biobank-scale data. For genetic genealogical search, PBWT-based methods have optimized the asymptotic runtime of finding long matches between a query haplotype and a predefined panel of haplotypes. However, to enable fast query searches, the full-sized panel and PBWT data structures must be kept in memory, preventing existing algorithms from scaling up to modern biobank panels consisting of millions of haplotypes. In this work, we propose a space-efficient variation of PBWT named Syllable-PBWT, which divides every haplotype into syllables, builds the PBWT positional prefix arrays on the compressed syllabic panel, and leverages the polynomial rolling hash function for positional substring comparison. With the Syllable-PBWT data structures, we then present a long match query algorithm named Syllable-Query.

RESULTS

Compared to the most time- and space-efficient previously published solution to the long match query problem, Syllable-Query reduced the memory use by a factor of over 100 on both the UK Biobank genotype data and the 1000 Genomes Project sequence data. Surprisingly, the smaller size of our syllabic data structures allows for more efficient iteration and CPU cache usage, granting Syllable-Query even faster runtime than existing solutions.

AVAILABILITY AND IMPLEMENTATION

https://github.com/ZhiGroup/Syllable-PBWT.

SUPPLEMENTARY INFORMATION

Supplementary data are available at Bioinformatics online.

摘要

动机

位置 Burrows-Wheeler 变换（PBWT）在生物库规模数据的单倍型匹配方面取得了巨大进展。对于遗传谱系搜索，基于 PBWT 的方法优化了在查询单倍型和预定义的单倍型面板之间找到长匹配的渐近运行时间。然而，为了实现快速查询搜索，必须将完整大小的面板和 PBWT 数据结构保留在内存中，这使得现有的算法无法扩展到由数百万个单倍型组成的现代生物库面板。在这项工作中，我们提出了一种名为音节 PBWT 的 PBWT 的空间高效变体，它将每个单倍型划分为音节，在压缩的音节面板上构建 PBWT 位置前缀数组，并利用多项式滚动哈希函数进行位置子串比较。然后，我们使用音节 PBWT 数据结构提出了一种名为音节查询的长匹配查询算法。

结果

与长匹配查询问题最节省时间和空间的先前发布的解决方案相比，音节查询在英国生物库基因型数据和 1000 基因组计划序列数据上分别将内存使用减少了 100 多倍。令人惊讶的是，我们的音节数据结构更小，允许更有效的迭代和 CPU 缓存使用，使得音节查询甚至比现有解决方案更快的运行时间。

可用性和实现

https://github.com/ZhiGroup/Syllable-PBWT。

补充信息

补充数据可在生物信息学在线获得。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/29d2/9805553/ca000990691e/btac734f1.jpg

相似文献

Syllable-PBWT for space-efficient haplotype long-match query.基于音节的 PBWT 用于高效空间的单倍型长匹配查询。

Bioinformatics. 2023 Jan 1;39(1). doi: 10.1093/bioinformatics/btac734.

Efficient haplotype matching between a query and a panel for genealogical search.针对系谱搜索，查询与面板之间的高效单倍型匹配。

Bioinformatics. 2019 Jul 15;35(14):i233-i241. doi: 10.1093/bioinformatics/btz347.

Dynamic -PBWT: Dynamic Run-length Compressed PBWT for Biobank Scale Data.动态 -PBWT：用于生物样本库规模数据的动态游程长度编码PBWT

bioRxiv. 2025 Feb 8:2025.02.04.636479. doi: 10.1101/2025.02.04.636479.

d-PBWT: dynamic positional Burrows-Wheeler transform.d-PBWT：动态位置布罗算法变换

Bioinformatics. 2021 Aug 25;37(16):2390-2397. doi: 10.1093/bioinformatics/btab117.

μ- PBWT: a lightweight r-indexing of the PBWT for storing and querying UK Biobank data.μ-PBWT：用于存储和查询 UK Biobank 数据的轻量级 PBWT r-索引。

Bioinformatics. 2023 Sep 2;39(9). doi: 10.1093/bioinformatics/btad552.

P-smoother: efficient PBWT smoothing of large haplotype panels.P-平滑器：对大型单倍型面板进行高效的基于位置的小波变换平滑处理

Bioinform Adv. 2022 Jun 20;2(1):vbac045. doi: 10.1093/bioadv/vbac045. eCollection 2022.

Haplotype Matching with GBWT for Pangenome Graphs.用于泛基因组图的基于广义布隆游走树的单倍型匹配

bioRxiv. 2025 Feb 7:2025.02.03.634410. doi: 10.1101/2025.02.03.634410.

Haplotype-based Parallel PBWT for Biobank Scale Data.基于单倍型的并行排列Burrows-Wheeler变换用于生物样本库规模的数据

bioRxiv. 2025 Feb 8:2025.02.04.636317. doi: 10.1101/2025.02.04.636317.

Efficient haplotype matching and storage using the positional Burrows-Wheeler transform (PBWT).利用位置 Burrows-Wheeler 变换 (PBWT) 实现高效单倍型匹配和存储。

Bioinformatics. 2014 May 1;30(9):1266-72. doi: 10.1093/bioinformatics/btu014. Epub 2014 Jan 9.

Exploiting parallelization in positional Burrows-Wheeler transform (PBWT) algorithms for efficient haplotype matching and compression.利用位置布隆-惠勒变换（PBWT）算法中的并行化实现高效单倍型匹配与压缩。

Bioinform Adv. 2023 Mar 2;3(1):vbad021. doi: 10.1093/bioadv/vbad021. eCollection 2023.

引用本文的文献

Dynamic -PBWT: Dynamic Run-length Compressed PBWT for Biobank Scale Data.动态 -PBWT：用于生物样本库规模数据的动态游程长度编码PBWT

bioRxiv. 2025 Feb 8:2025.02.04.636479. doi: 10.1101/2025.02.04.636479.

Haplotype Matching with GBWT for Pangenome Graphs.用于泛基因组图的基于广义布隆游走树的单倍型匹配

bioRxiv. 2025 Feb 7:2025.02.03.634410. doi: 10.1101/2025.02.03.634410.

vcfpp: a C++ API for rapid processing of the variant call format.vcfpp：一种用于快速处理变异调用格式的 C++ API。

Bioinformatics. 2024 Feb 1;40(2). doi: 10.1093/bioinformatics/btae049.

μ- PBWT: a lightweight r-indexing of the PBWT for storing and querying UK Biobank data.μ-PBWT：用于存储和查询 UK Biobank 数据的轻量级 PBWT r-索引。

Bioinformatics. 2023 Sep 2;39(9). doi: 10.1093/bioinformatics/btad552.

Bioinform Adv. 2023 Mar 2;3(1):vbad021. doi: 10.1093/bioadv/vbad021. eCollection 2023.

本文引用的文献

Minimizer-space de Bruijn graphs: Whole-genome assembly of long reads in minutes on a personal computer.最小化空间 de Bruijn 图：在个人计算机上数分钟内完成长读段的全基因组组装。

Cell Syst. 2021 Oct 20;12(10):958-968.e6. doi: 10.1016/j.cels.2021.08.009. Epub 2021 Sep 14.

d-PBWT: dynamic positional Burrows-Wheeler transform.d-PBWT：动态位置布罗算法变换

Bioinformatics. 2021 Aug 25;37(16):2390-2397. doi: 10.1093/bioinformatics/btab117.

Fast and Robust Identity-by-Descent Inference with the Templated Positional Burrows-Wheeler Transform.基于模板化位置 Burrows-Wheeler 变换的快速稳健的同宗推断。

Mol Biol Evol. 2021 May 4;38(5):2131-2151. doi: 10.1093/molbev/msaa328.

Genotype imputation using the Positional Burrows Wheeler Transform.基于位置的 Burrows-Wheeler 变换的基因型推断。

PLoS Genet. 2020 Nov 16;16(11):e1009049. doi: 10.1371/journal.pgen.1009049. eCollection 2020 Nov.

A Fast and Simple Method for Detecting Identity-by-Descent Segments in Large-Scale Data.一种在大规模数据中快速简单检测同源片段的方法。

Am J Hum Genet. 2020 Apr 2;106(4):426-437. doi: 10.1016/j.ajhg.2020.02.010. Epub 2020 Mar 12.

Accurate, scalable and integrative haplotype estimation.精确、可扩展且综合的单倍型估计。

Nat Commun. 2019 Nov 28;10(1):5436. doi: 10.1038/s41467-019-13225-y.

Efficient haplotype matching between a query and a panel for genealogical search.针对系谱搜索，查询与面板之间的高效单倍型匹配。

Bioinformatics. 2019 Jul 15;35(14):i233-i241. doi: 10.1093/bioinformatics/btz347.

Haplotype-aware graph indexes.单体型感知图索引。

Bioinformatics. 2020 Jan 15;36(2):400-407. doi: 10.1093/bioinformatics/btz575.

RaPID: ultra-fast, powerful, and accurate detection of segments identical by descent (IBD) in biobank-scale cohorts.RaPID：在生物库规模队列中快速、强大且准确地检测由同源片段（IBD）

Genome Biol. 2019 Jul 25;20(1):143. doi: 10.1186/s13059-019-1754-8.

Multi-allelic positional Burrows-Wheeler transform.多等位基因位置 Burrows-Wheeler 变换。

BMC Bioinformatics. 2019 Jun 6;20(Suppl 11):279. doi: 10.1186/s12859-019-2821-6.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

基于音节的 PBWT 用于高效空间的单倍型长匹配查询。

Syllable-PBWT for space-efficient haplotype long-match query.

机构信息

出版信息

MOTIVATION

RESULTS

AVAILABILITY AND IMPLEMENTATION

SUPPLEMENTARY INFORMATION

动机

结果

可用性和实现

补充信息

相似文献

引用本文的文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献