• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

用于序列集合的外部内存BWT和LCP计算及其应用

External memory BWT and LCP computation for sequence collections with applications.

作者信息

Egidi Lavinia, Louza Felipe A, Manzini Giovanni, Telles Guilherme P

机构信息

1DiSIT, University of Eastern Piedmont, Viale Michel, 11, 15121 Alessandria, Italy.

2Department of Computing and Mathematics, University of São Paulo, Av. Bandeirantes, 3900, 14040-901 Ribeirão Preto, Brazil.

出版信息

Algorithms Mol Biol. 2019 Mar 8;14:6. doi: 10.1186/s13015-019-0140-0. eCollection 2019.

DOI:10.1186/s13015-019-0140-0
PMID:30899322
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC6408864/
Abstract

BACKGROUND

Sequencing technologies produce larger and larger collections of biosequences that have to be stored in compressed indices supporting fast search operations. Many compressed indices are based on the Burrows-Wheeler Transform (BWT) and the longest common prefix (LCP) array. Because of the sheer size of the input it is important to build these data structures in external memory and time using in the best possible way the available RAM.

RESULTS

We propose a space-efficient algorithm to compute the BWT and LCP array for a collection of sequences in the external or semi-external memory setting. Our algorithm splits the input collection into subcollections sufficiently small that it can compute their BWT in RAM using an optimal linear time algorithm. Next, it merges the partial BWTs in external or semi-external memory and in the process it also computes the LCP values. Our algorithm can be modified to output two additional arrays that, combined with the BWT and LCP array, provide simple, scan-based, external memory algorithms for three well known problems in bioinformatics: the computation of maximal repeats, the all pairs suffix-prefix overlaps, and the construction of succinct de Bruijn graphs.

CONCLUSIONS

We prove that our algorithm performs sequential I/Os, where is the total length of the collection and is the maximum LCP value. The experimental results show that our algorithm is only slightly slower than the state of the art for short sequences but it is up to 40 times faster for longer sequences or when the available RAM is at least equal to the size of the input.

摘要

背景

测序技术产生的生物序列集合越来越大,必须存储在支持快速搜索操作的压缩索引中。许多压缩索引基于Burrows-Wheeler变换(BWT)和最长公共前缀(LCP)数组。由于输入数据量巨大,以最佳方式利用可用随机存取存储器(RAM)在外部存储器中高效地构建这些数据结构非常重要。

结果

我们提出了一种空间高效的算法,用于在外部或半外部存储器设置下计算序列集合的BWT和LCP数组。我们的算法将输入集合拆分为足够小的子集合,以便能够使用最优线性时间算法在RAM中计算它们的BWT。接下来,它在外部或半外部存储器中合并部分BWT,并在此过程中计算LCP值。我们的算法可以修改为输出另外两个数组,与BWT和LCP数组结合,为生物信息学中的三个著名问题提供简单的、基于扫描的外部存储器算法:最大重复序列的计算、所有后缀-前缀重叠对的计算以及简洁德布鲁因图的构建。

结论

我们证明我们的算法执行 次顺序输入/输出操作,其中 是集合的总长度, 是最大LCP值。实验结果表明,对于短序列,我们的算法仅比现有技术稍慢,但对于长序列或当可用RAM至少等于输入大小时,它快达40倍。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f83e/6408864/e02c262a04c8/13015_2019_140_Fig4_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f83e/6408864/c9b6b7b79f17/13015_2019_140_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f83e/6408864/c5cc9b3c1650/13015_2019_140_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f83e/6408864/6fc85e336918/13015_2019_140_Fig3_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f83e/6408864/e02c262a04c8/13015_2019_140_Fig4_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f83e/6408864/c9b6b7b79f17/13015_2019_140_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f83e/6408864/c5cc9b3c1650/13015_2019_140_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f83e/6408864/6fc85e336918/13015_2019_140_Fig3_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f83e/6408864/e02c262a04c8/13015_2019_140_Fig4_HTML.jpg

相似文献

1
External memory BWT and LCP computation for sequence collections with applications.用于序列集合的外部内存BWT和LCP计算及其应用
Algorithms Mol Biol. 2019 Mar 8;14:6. doi: 10.1186/s13015-019-0140-0. eCollection 2019.
2
Multithread Multistring Burrows-Wheeler Transform and Longest Common Prefix Array.多线程多字符串Burrows-Wheeler变换与最长公共前缀数组
J Comput Biol. 2019 Sep;26(9):948-961. doi: 10.1089/cmb.2018.0230. Epub 2019 May 29.
3
gsufsort: constructing suffix arrays, LCP arrays and BWTs for string collections.gsufsort:为字符串集合构建后缀数组、最长公共前缀数组和Burrows-Wheeler变换
Algorithms Mol Biol. 2020 Sep 22;15:18. doi: 10.1186/s13015-020-00177-y. eCollection 2020.
4
Generalized enhanced suffix array construction in external memory.外部内存中的广义增强后缀数组构造
Algorithms Mol Biol. 2017 Dec 7;12:26. doi: 10.1186/s13015-017-0117-9. eCollection 2017.
5
PFP Compressed Suffix Trees.PFP压缩后缀树
Proc Worksh Algorithm Eng Exp. 2021;2021:60-72. doi: 10.1137/1.9781611976472.5.
6
Computing the original eBWT faster, simpler, and with less memory.更快、更简单且占用更少内存地计算原始增强型Burrows-Wheeler变换。
Int Symp String Process Inf Retr. 2021 Oct;12944:129-142. doi: 10.1007/978-3-030-86692-1_11. Epub 2021 Sep 27.
7
deBWT: parallel construction of Burrows-Wheeler Transform for large collection of genomes with de Bruijn-branch encoding.deBWT:用于大量基因组集合的具有德布鲁因分支编码的Burrows-Wheeler变换的并行构建。
Bioinformatics. 2016 Jun 15;32(12):i174-i182. doi: 10.1093/bioinformatics/btw266.
8
Space-time Trade-offs for the LCP Array of Wheeler DFAs.惠勒确定有限自动机(DFA)的LCP数组的时空权衡
Int Symp String Process Inf Retr. 2023 Sep;14240:143-156. doi: 10.1007/978-3-031-43980-3_12. Epub 2023 Sep 20.
9
Variable-order reference-free variant discovery with the Burrows-Wheeler Transform.基于 Burrows-Wheeler 变换的变阶无参考变异发现。
BMC Bioinformatics. 2020 Sep 16;21(Suppl 8):260. doi: 10.1186/s12859-020-03586-3.
10
Prefix-free parsing for building big BWTs.用于构建大型Burrows-Wheeler变换(BWT)的无前缀解析
Algorithms Mol Biol. 2019 May 24;14:13. doi: 10.1186/s13015-019-0148-5. eCollection 2019.

引用本文的文献

1
A survey of BWT variants for string collections.针对字符串集合的BWT变体调查。
Bioinformatics. 2024 May 24;40(7). doi: 10.1093/bioinformatics/btae333.
2
Computing the original eBWT faster, simpler, and with less memory.更快、更简单且占用更少内存地计算原始增强型Burrows-Wheeler变换。
Int Symp String Process Inf Retr. 2021 Oct;12944:129-142. doi: 10.1007/978-3-030-86692-1_11. Epub 2021 Sep 27.
3
Fast, parallel, and cache-friendly suffix array construction.快速、并行且缓存友好的后缀数组构造。

本文引用的文献

1
MUMmer4: A fast and versatile genome alignment system.MUMmer4:一种快速且通用的基因组比对系统。
PLoS Comput Biol. 2018 Jan 26;14(1):e1005944. doi: 10.1371/journal.pcbi.1005944. eCollection 2018 Jan.
2
Generalized enhanced suffix array construction in external memory.外部内存中的广义增强后缀数组构造
Algorithms Mol Biol. 2017 Dec 7;12:26. doi: 10.1186/s13015-017-0117-9. eCollection 2017.
3
Succinct colored de Bruijn graphs.简明彩色 de Bruijn 图。
Algorithms Mol Biol. 2024 Apr 28;19(1):16. doi: 10.1186/s13015-024-00263-5.
4
phyBWT2: phylogeny reconstruction via eBWT positional clustering.phyBWT2:通过增强型Burrows-Wheeler变换位置聚类进行系统发育重建
Algorithms Mol Biol. 2023 Aug 3;18(1):11. doi: 10.1186/s13015-023-00232-4.
5
Metagenomic analysis through the extended Burrows-Wheeler transform.基于扩展的 Burrows-Wheeler 变换的宏基因组分析。
BMC Bioinformatics. 2020 Sep 16;21(Suppl 8):299. doi: 10.1186/s12859-020-03628-w.
6
Variable-order reference-free variant discovery with the Burrows-Wheeler Transform.基于 Burrows-Wheeler 变换的变阶无参考变异发现。
BMC Bioinformatics. 2020 Sep 16;21(Suppl 8):260. doi: 10.1186/s12859-020-03586-3.
Bioinformatics. 2017 Oct 15;33(20):3181-3187. doi: 10.1093/bioinformatics/btx067.
4
Merging of multi-string BWTs with applications.多字符串 BWT 的合并及其应用。
Bioinformatics. 2014 Dec 15;30(24):3524-31. doi: 10.1093/bioinformatics/btu584. Epub 2014 Aug 28.
5
Efficient maximal repeat finding using the burrows-wheeler transform and wavelet tree.利用布罗沃德-惠勒变换和小波树进行高效最大重复查找。
IEEE/ACM Trans Comput Biol Bioinform. 2012;9(2):421-9. doi: 10.1109/TCBB.2011.127. Epub 2011 Sep 27.