• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

b移动:游程长度压缩索引中更快的双向字符扩展

b-move: faster bidirectional character extensions in a run-length compressed index.

作者信息

Depuydt Lore, Renders Luca, de Vyver Simon Van, Veys Lennart, Gagie Travis, Fostier Jan

机构信息

Ghent University - imec, Technologiepark 126, 9052 Ghent, Belgium.

Ghent University, Technologiepark 126, 9052 Ghent, Belgium.

出版信息

bioRxiv. 2024 Jun 2:2024.05.30.596587. doi: 10.1101/2024.05.30.596587.

DOI:10.1101/2024.05.30.596587
PMID:38854079
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11160816/
Abstract

Due to the increasing availability of high-quality genome sequences, pan-genomes are gradually replacing single consensus reference genomes in many bioinformatics pipelines to better capture genetic diversity. Traditional bioinformatics tools using the FM-index face memory limitations with such large genome collections. Recent advancements in run-length compressed indices like Gagie et al.'s r-index and Nishimoto and Tabei's move structure, alleviate memory constraints but focus primarily on backward search for MEM-finding. Arakawa et al.'s br-index initiates complete approximate pattern matching using bidirectional search in run-length compressed space, but with significant computational overhead due to complex memory access patterns. We introduce b-move, a novel bidirectional extension of the move structure, enabling fast, cache-efficient bidirectional character extensions in run-length compressed space. It achieves bidirectional character extensions up to 8 times faster than the br-index, closing the performance gap with FM-index-based alternatives, while maintaining the br-index's favorable memory characteristics. For example, all available complete genomes on NCBI's RefSeq collection can be compiled into a b-move index that fits into the RAM of a typical laptop. Thus, b-move proves practical and scalable for pan-genome indexing and querying. We provide a C++ implementation of b-move, supporting efficient lossless approximate pattern matching including locate functionality, available at https://github.com/biointec/b-move under the AGPL-3.0 license.

摘要

由于高质量基因组序列的可得性不断提高,在许多生物信息学流程中,泛基因组正逐渐取代单一的一致性参考基因组,以更好地捕捉遗传多样性。使用FM索引的传统生物信息学工具在处理如此庞大的基因组集合时面临内存限制。像Gagie等人的r索引和Nishimoto与Tabei的移动结构这样的游程长度压缩索引的最新进展,缓解了内存限制,但主要侧重于用于查找MEM的反向搜索。Arakawa等人的br索引在游程长度压缩空间中使用双向搜索启动完全近似模式匹配,但由于复杂的内存访问模式而存在显著的计算开销。我们引入了b移动,它是移动结构的一种新颖的双向扩展,能够在游程长度压缩空间中实现快速、缓存高效的双向字符扩展。它实现双向字符扩展的速度比br索引快8倍,缩小了与基于FM索引的替代方案之间的性能差距,同时保持了br索引良好的内存特性。例如,NCBI的RefSeq集合中所有可用的完整基因组都可以编译成一个适合典型笔记本电脑内存的b移动索引。因此,b移动在泛基因组索引和查询方面被证明是实用且可扩展的。我们提供了b移动的C++实现,支持包括定位功能在内的高效无损近似模式匹配,可在https://github.com/biointec/b-move上以AGPL-3.0许可获取。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/31fd/11160816/5fe102cb5758/nihpp-2024.05.30.596587v1-f0008.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/31fd/11160816/1053c6215157/nihpp-2024.05.30.596587v1-f0006.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/31fd/11160816/2c43cbe0b5b3/nihpp-2024.05.30.596587v1-f0007.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/31fd/11160816/5fe102cb5758/nihpp-2024.05.30.596587v1-f0008.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/31fd/11160816/1053c6215157/nihpp-2024.05.30.596587v1-f0006.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/31fd/11160816/2c43cbe0b5b3/nihpp-2024.05.30.596587v1-f0007.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/31fd/11160816/5fe102cb5758/nihpp-2024.05.30.596587v1-f0008.jpg

相似文献

1
b-move: faster bidirectional character extensions in a run-length compressed index.b移动:游程长度压缩索引中更快的双向字符扩展
bioRxiv. 2024 Jun 2:2024.05.30.596587. doi: 10.1101/2024.05.30.596587.
2
b-move: Faster Lossless Approximate Pattern Matching in a Run-Length Compressed Index.b移动:游程长度压缩索引中的更快无损近似模式匹配
Res Sq. 2024 Nov 18:rs.3.rs-5367343. doi: 10.21203/rs.3.rs-5367343/v1.
3
Pan-genome de Bruijn graph using the bidirectional FM-index.基于双向 FM-index 的泛基因组 de Bruijn 图
BMC Bioinformatics. 2023 Oct 26;24(1):400. doi: 10.1186/s12859-023-05531-6.
4
An optimized FM-index library for nucleotide and amino acid search.一个用于核苷酸和氨基酸搜索的优化FM索引库。
Algorithms Mol Biol. 2021 Dec 31;16(1):25. doi: 10.1186/s13015-021-00204-6.
5
Distributed hybrid-indexing of compressed pan-genomes for scalable and fast sequence alignment.压缩泛基因组的分布式混合索引,实现可扩展和快速的序列比对。
PLoS One. 2021 Aug 3;16(8):e0255260. doi: 10.1371/journal.pone.0255260. eCollection 2021.
6
Run-length compressed metagenomic read classification with SMEM-finding and tagging.基于SMEM查找和标记的游程长度压缩宏基因组读取分类
bioRxiv. 2025 Mar 24:2025.02.25.640119. doi: 10.1101/2025.02.25.640119.
7
MONI: A Pangenomic Index for Finding Maximal Exact Matches.MONI:用于寻找最大精确匹配的泛基因组索引。
J Comput Biol. 2022 Feb;29(2):169-187. doi: 10.1089/cmb.2021.0290. Epub 2022 Jan 17.
8
A space and time-efficient index for the compacted colored de Bruijn graph.一种用于压缩彩色 de Bruijn 图的空间和时间高效索引。
Bioinformatics. 2018 Jul 1;34(13):i169-i177. doi: 10.1093/bioinformatics/bty292.
9
Movi: a fast and cache-efficient full-text pangenome index.Movi:一种快速且缓存高效的全基因组索引。
bioRxiv. 2024 Feb 15:2023.11.04.565615. doi: 10.1101/2023.11.04.565615.
10
Centrifuger: lossless compression of microbial genomes for efficient and accurate metagenomic sequence classification.Centrifuger:用于高效准确的宏基因组序列分类的微生物基因组无损压缩
bioRxiv. 2023 Nov 17:2023.11.15.567129. doi: 10.1101/2023.11.15.567129.

本文引用的文献

1
Movi: A fast and cache-efficient full-text pangenome index.Movi:一种快速且缓存高效的全基因组索引。
iScience. 2024 Nov 27;27(12):111464. doi: 10.1016/j.isci.2024.111464. eCollection 2024 Dec 20.
2
Lossless Approximate Pattern Matching: Automated Design of Efficient Search Schemes.无损近似模式匹配:高效搜索方案的自动化设计。
J Comput Biol. 2024 Oct;31(10):975-989. doi: 10.1089/cmb.2024.0664. Epub 2024 Sep 30.
3
Faster Maximal Exact Matches with Lazy LCP Evaluation.通过延迟最长公共前缀(LCP)评估实现更快的最大精确匹配
Proc Data Compress Conf. 2024 Mar;2024:123-132. doi: 10.1109/dcc58796.2024.00020. Epub 2024 May 21.
4
Pan-genome de Bruijn graph using the bidirectional FM-index.基于双向 FM-index 的泛基因组 de Bruijn 图
BMC Bioinformatics. 2023 Oct 26;24(1):400. doi: 10.1186/s12859-023-05531-6.
5
SPUMONI 2: improved classification using a pangenome index of minimizer digests.SPUMONI 2:使用最小化消化物的泛基因组指数进行改进分类。
Genome Biol. 2023 May 18;24(1):122. doi: 10.1186/s13059-023-02958-1.
6
MONI: A Pangenomic Index for Finding Maximal Exact Matches.MONI:用于寻找最大精确匹配的泛基因组索引。
J Comput Biol. 2022 Feb;29(2):169-187. doi: 10.1089/cmb.2021.0290. Epub 2022 Jan 17.
7
PHONI: Streamed Matching Statistics with Multi-Genome References.PHONI:多基因组参考的流式匹配统计
Proc Data Compress Conf. 2021 Mar;2021:193-202. doi: 10.1109/dcc50243.2021.00027. Epub 2021 May 10.
8
Dynamic partitioning of search patterns for approximate pattern matching using search schemes.使用搜索方案对近似模式匹配的搜索模式进行动态分区。
iScience. 2021 Jun 10;24(7):102687. doi: 10.1016/j.isci.2021.102687. eCollection 2021 Jul 23.
9
Pan-genomic matching statistics for targeted nanopore sequencing.靶向纳米孔测序的泛基因组匹配统计
iScience. 2021 Jun 8;24(6):102696. doi: 10.1016/j.isci.2021.102696. eCollection 2021 Jun 25.
10
Computational pan-genomics: status, promises and challenges.计算泛基因组学:现状、前景与挑战。
Brief Bioinform. 2018 Jan 1;19(1):118-135. doi: 10.1093/bib/bbw089.