Suppr超能文献

用于在PBWT中查找SMEM的数据结构。

Data Structures for SMEM-Finding in the PBWT.

作者信息

Bonizzoni Paola, Boucher Christina, Cozzi Davide, Gagie Travis, Köppl Dominik, Rossi Massimiliano

机构信息

University of Milano-Bicocca, Milano, Italy.

University of Florida, Gainesville, FL.

出版信息

Int Symp String Process Inf Retr. 2023 Sep;14240:89-101. doi: 10.1007/978-3-031-43980-3_8. Epub 2023 Sep 20.

Abstract

The positional Burrows-Wheeler Transform (PBWT) was presented as a means to find set-maximal exact matches (SMEMs) in haplotype data via the computation of the divergence array. Although run-length encoding the PBWT has been previously considered, storing the divergence array along with the PBWT in a compressed manner has not been as rigorously studied. We define two queries that can be used in combination to compute SMEMs, allowing us to define smaller data structures that support one or both of these queries. We combine these data structures, enabling the PBWT and the divergence array to be stored in a manner that allows for finding SMEMs. We estimate and compare the memory usage of these data structures, leading to one data structure that is most memory efficient. Lastly, we implement this data structure and compare its performance to prior methods using various datasets taken from the 1000 Genomes Project data.

摘要

位置布罗-惠勒变换(PBWT)被提出作为一种通过计算差异数组在单倍型数据中找到集合最大精确匹配(SMEM)的方法。尽管之前已经考虑过对PBWT进行游程编码,但以压缩方式存储差异数组和PBWT尚未得到如此严格的研究。我们定义了两个可以结合使用以计算SMEM的查询,这使我们能够定义支持其中一个或两个查询的更小的数据结构。我们将这些数据结构组合起来,使PBWT和差异数组能够以一种允许找到SMEM的方式存储。我们估计并比较这些数据结构的内存使用情况,得出一种内存效率最高的数据结构。最后,我们实现了这种数据结构,并将其性能与使用来自千人基因组计划数据的各种数据集的先前方法进行比较。

相似文献

1
Data Structures for SMEM-Finding in the PBWT.用于在PBWT中查找SMEM的数据结构。
Int Symp String Process Inf Retr. 2023 Sep;14240:89-101. doi: 10.1007/978-3-031-43980-3_8. Epub 2023 Sep 20.
3
d-PBWT: dynamic positional Burrows-Wheeler transform.d-PBWT:动态位置布罗算法变换
Bioinformatics. 2021 Aug 25;37(16):2390-2397. doi: 10.1093/bioinformatics/btab117.
5
9
Multi-allelic positional Burrows-Wheeler transform.多等位基因位置 Burrows-Wheeler 变换。
BMC Bioinformatics. 2019 Jun 6;20(Suppl 11):279. doi: 10.1186/s12859-019-2821-6.

引用本文的文献

本文引用的文献

1
MONI: A Pangenomic Index for Finding Maximal Exact Matches.MONI:用于寻找最大精确匹配的泛基因组索引。
J Comput Biol. 2022 Feb;29(2):169-187. doi: 10.1089/cmb.2021.0290. Epub 2022 Jan 17.
2
Twelve years of SAMtools and BCFtools.SAMtools 和 BCFtools 十二年。
Gigascience. 2021 Feb 16;10(2). doi: 10.1093/gigascience/giab008.
3
BGT: efficient and flexible genotype query across many samples.BGT:跨多个样本进行高效灵活的基因型查询。
Bioinformatics. 2016 Feb 15;32(4):590-2. doi: 10.1093/bioinformatics/btv613. Epub 2015 Oct 24.
5
Fast and flexible simulation of DNA sequence data.DNA序列数据的快速灵活模拟。
Genome Res. 2009 Jan;19(1):136-42. doi: 10.1101/gr.083634.108. Epub 2008 Nov 24.

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验