• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

b移动:游程长度压缩索引中的更快无损近似模式匹配

b-move: Faster Lossless Approximate Pattern Matching in a Run-Length Compressed Index.

作者信息

Depuydt Lore, Renders Luca, Van de Vyver Simon, Veys Lennart, Gagie Travis, Fostier Jan

机构信息

Ghent University - imec, Technologiepark 126, 9052 Ghent, Belgium.

Ghent University, Technologiepark 126, 9052 Ghent, Belgium.

出版信息

Res Sq. 2024 Nov 18:rs.3.rs-5367343. doi: 10.21203/rs.3.rs-5367343/v1.

DOI:10.21203/rs.3.rs-5367343/v1
PMID:39606487
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11601852/
Abstract

BACKGROUND

Due to the increasing availability of high-quality genome sequences, pan-genomes are gradually replacing single consensus reference genomes in many bioinformatics pipelines to better capture genetic diversity. Traditional bioinformatics tools using the FM-index face memory limitations with such large genome collections. Recent advancements in run-length compressed indices like Gagie et al.'s r-index and Nishimoto and Tabei's move structure, alleviate memory constraints but focus primarily on backward search for MEM-finding. Arakawa et al.'s br-index initiates complete approximate pattern matching using bidirectional search in run-length compressed space, but with significant computational overhead due to complex memory access patterns.

RESULTS

We introduce b-move, a novel bidirectional extension of the move structure, enabling fast, cache-efficient, lossless approximate pattern matching in run-length compressed space. It achieves bidirectional character extensions up to 7 times faster than the br-index, closing the performance gap with FM-index-based alternatives. For locating occurrences, b-move performs and operations up to 7 times faster than the br-index. At the same time, it maintains the favorable memory characteristics of the br-index, for example, all available complete genomes on NCBI's RefSeq collection can be compiled into a b-move index that fits into the RAM of a typical laptop.

CONCLUSIONS

b-move proves practical and scalable for pan-genome indexing and querying. We provide a C++ implementation of b-move, supporting efficient lossless approximate pattern matching including locate functionality, available at https://github.com/biointec/b-move under the AGPL-3.0 license.

摘要

背景

由于高质量基因组序列的可得性不断提高,在许多生物信息学流程中,泛基因组正逐渐取代单一的共有参考基因组,以更好地捕捉遗传多样性。使用FM索引的传统生物信息学工具在处理如此庞大的基因组集合时面临内存限制。像Gagie等人的r索引和Nishimoto与Tabei的移动结构等游程长度压缩索引的最新进展,缓解了内存限制,但主要侧重于向后搜索以查找MEM。Arakawa等人的br索引在游程长度压缩空间中使用双向搜索启动完全近似模式匹配,但由于复杂的内存访问模式而具有显著的计算开销。

结果

我们引入了b移动,这是移动结构的一种新颖的双向扩展,能够在游程长度压缩空间中实现快速、缓存高效、无损的近似模式匹配。它实现双向字符扩展的速度比br索引快7倍,缩小了与基于FM索引的替代方法之间的性能差距。对于定位出现位置,b移动执行 和 操作的速度比br索引快7倍。同时,它保持了br索引良好的内存特性,例如,NCBI的RefSeq集合中所有可用的完整基因组都可以编译成一个适合典型笔记本电脑内存的b移动索引。

结论

b移动在泛基因组索引和查询方面被证明是实用且可扩展的。我们提供了b移动的C++实现,支持高效的无损近似模式匹配,包括定位功能,可在https://github.com/biointec/b-move上以AGPL-3.0许可获取。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/33c5/11601852/554a424c8195/nihpp-rs5367343v1-f0006.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/33c5/11601852/6a862c4e9de8/nihpp-rs5367343v1-f0001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/33c5/11601852/cfc714ad31f5/nihpp-rs5367343v1-f0002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/33c5/11601852/c0493e13b543/nihpp-rs5367343v1-f0003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/33c5/11601852/a167aee85386/nihpp-rs5367343v1-f0004.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/33c5/11601852/64af6a58cf99/nihpp-rs5367343v1-f0005.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/33c5/11601852/554a424c8195/nihpp-rs5367343v1-f0006.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/33c5/11601852/6a862c4e9de8/nihpp-rs5367343v1-f0001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/33c5/11601852/cfc714ad31f5/nihpp-rs5367343v1-f0002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/33c5/11601852/c0493e13b543/nihpp-rs5367343v1-f0003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/33c5/11601852/a167aee85386/nihpp-rs5367343v1-f0004.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/33c5/11601852/64af6a58cf99/nihpp-rs5367343v1-f0005.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/33c5/11601852/554a424c8195/nihpp-rs5367343v1-f0006.jpg

相似文献

1
b-move: Faster Lossless Approximate Pattern Matching in a Run-Length Compressed Index.b移动:游程长度压缩索引中的更快无损近似模式匹配
Res Sq. 2024 Nov 18:rs.3.rs-5367343. doi: 10.21203/rs.3.rs-5367343/v1.
2
b-move: faster lossless approximate pattern matching in a run-length compressed index.b移动:在游程长度压缩索引中实现更快的无损近似模式匹配。
Algorithms Mol Biol. 2025 Aug 12;20(1):15. doi: 10.1186/s13015-025-00281-x.
3
b-move: faster bidirectional character extensions in a run-length compressed index.b移动:游程长度压缩索引中更快的双向字符扩展
bioRxiv. 2024 Jun 2:2024.05.30.596587. doi: 10.1101/2024.05.30.596587.
4
Run-length compressed metagenomic read classification with SMEM-finding and tagging.基于SMEM查找和标记的游程长度压缩宏基因组读取分类
bioRxiv. 2025 Mar 24:2025.02.25.640119. doi: 10.1101/2025.02.25.640119.
5
Prescription of Controlled Substances: Benefits and Risks管制药品的处方:益处与风险
6
Aspects of Genetic Diversity, Host Specificity and Public Health Significance of Single-Celled Intestinal Parasites Commonly Observed in Humans and Mostly Referred to as 'Non-Pathogenic'.人类常见且大多被称为“非致病性”的单细胞肠道寄生虫的遗传多样性、宿主特异性及公共卫生意义
APMIS. 2025 Sep;133(9):e70036. doi: 10.1111/apm.70036.
7
Electrophoresis电泳
8
Effloc: An Efficient Locating Algorithm for Mass-Occurrence Biological Patterns with FM-Index.Effloc:一种基于FM索引的大规模出现生物模式的高效定位算法。
J Comput Biol. 2025 Sep;32(9):865-878. doi: 10.1089/cmb.2024.0925. Epub 2025 May 2.
9
Movi Color: fast and accurate long-read classification with the move structure.Movi Color:利用移动结构进行快速准确的长读长分类。
bioRxiv. 2025 May 27:2025.05.22.655637. doi: 10.1101/2025.05.22.655637.
10
Lossless Pangenome Indexing Using Tag Arrays.使用标签数组的无损全基因组索引
bioRxiv. 2025 May 15:2025.05.12.653561. doi: 10.1101/2025.05.12.653561.

本文引用的文献

1
Movi: A fast and cache-efficient full-text pangenome index.Movi:一种快速且缓存高效的全基因组索引。
iScience. 2024 Nov 27;27(12):111464. doi: 10.1016/j.isci.2024.111464. eCollection 2024 Dec 20.
2
Lossless Approximate Pattern Matching: Automated Design of Efficient Search Schemes.无损近似模式匹配:高效搜索方案的自动化设计。
J Comput Biol. 2024 Oct;31(10):975-989. doi: 10.1089/cmb.2024.0664. Epub 2024 Sep 30.
3
Faster Maximal Exact Matches with Lazy LCP Evaluation.通过延迟最长公共前缀(LCP)评估实现更快的最大精确匹配
Proc Data Compress Conf. 2024 Mar;2024:123-132. doi: 10.1109/dcc58796.2024.00020. Epub 2024 May 21.
4
Pan-genome de Bruijn graph using the bidirectional FM-index.基于双向 FM-index 的泛基因组 de Bruijn 图
BMC Bioinformatics. 2023 Oct 26;24(1):400. doi: 10.1186/s12859-023-05531-6.
5
SPUMONI 2: improved classification using a pangenome index of minimizer digests.SPUMONI 2:使用最小化消化物的泛基因组指数进行改进分类。
Genome Biol. 2023 May 18;24(1):122. doi: 10.1186/s13059-023-02958-1.
6
MONI: A Pangenomic Index for Finding Maximal Exact Matches.MONI:用于寻找最大精确匹配的泛基因组索引。
J Comput Biol. 2022 Feb;29(2):169-187. doi: 10.1089/cmb.2021.0290. Epub 2022 Jan 17.
7
PHONI: Streamed Matching Statistics with Multi-Genome References.PHONI:多基因组参考的流式匹配统计
Proc Data Compress Conf. 2021 Mar;2021:193-202. doi: 10.1109/dcc50243.2021.00027. Epub 2021 May 10.
8
Dynamic partitioning of search patterns for approximate pattern matching using search schemes.使用搜索方案对近似模式匹配的搜索模式进行动态分区。
iScience. 2021 Jun 10;24(7):102687. doi: 10.1016/j.isci.2021.102687. eCollection 2021 Jul 23.
9
Pan-genomic matching statistics for targeted nanopore sequencing.靶向纳米孔测序的泛基因组匹配统计
iScience. 2021 Jun 8;24(6):102696. doi: 10.1016/j.isci.2021.102696. eCollection 2021 Jun 25.
10
Computational pan-genomics: status, promises and challenges.计算泛基因组学:现状、前景与挑战。
Brief Bioinform. 2018 Jan 1;19(1):118-135. doi: 10.1093/bib/bbw089.