Depuydt Lore, Renders Luca, Van de Vyver Simon, Veys Lennart, Gagie Travis, Fostier Jan
Ghent University-imec, Technologiepark 126, 9052, Ghent, Belgium.
Ghent University, Technologiepark 126, 9052, Ghent, Belgium.
Algorithms Mol Biol. 2025 Aug 12;20(1):15. doi: 10.1186/s13015-025-00281-x.
Due to the increasing availability of high-quality genome sequences, pan-genomes are gradually replacing single consensus reference genomes in many bioinformatics pipelines to better capture genetic diversity. Traditional bioinformatics tools using the FM-index face memory limitations with such large genome collections. Recent advancements in run-length compressed indices like Gagie et al.'s r-index and Nishimoto and Tabei's move structure, alleviate memory constraints but focus primarily on backward search for MEM-finding. Arakawa et al.'s br-index initiates complete approximate pattern matching using bidirectional search in run-length compressed space, but with significant computational overhead due to complex memory access patterns.
We introduce b-move, a novel bidirectional extension of the move structure, enabling fast, cache-efficient, lossless approximate pattern matching in run-length compressed space. It achieves bidirectional character extensions up to 7 times faster than the br-index, closing the performance gap with FM-index-based alternatives. For locating occurrences, b-move performs and operations up to 7 times faster than the br-index. At the same time, it maintains the favorable memory characteristics of the br-index, for example, all available complete E. coli genomes on NCBI's RefSeq collection can be compiled into a b-move index that fits into the RAM of a typical laptop.
b-move proves practical and scalable for pan-genome indexing and querying. We provide a C++ implementation of b-move, supporting efficient lossless approximate pattern matching including locate functionality, available at https://github.com/biointec/b-move under the AGPL-3.0 license.
由于高质量基因组序列的可用性不断提高,在许多生物信息学流程中,泛基因组正逐渐取代单一的共识参考基因组,以更好地捕捉遗传多样性。使用FM索引的传统生物信息学工具在处理如此庞大的基因组集合时面临内存限制。像Gagie等人的r索引和Nishimoto与Tabei的移动结构等游程长度压缩索引的最新进展,缓解了内存限制,但主要侧重于向后搜索以寻找MEM。Arakawa等人的br索引在游程长度压缩空间中使用双向搜索启动完全近似模式匹配,但由于复杂的内存访问模式而具有显著的计算开销。
我们引入了b移动,它是移动结构的一种新颖的双向扩展,能够在游程长度压缩空间中实现快速、缓存高效、无损的近似模式匹配。它实现双向字符扩展的速度比br索引快7倍,缩小了与基于FM索引的替代方法之间的性能差距。对于定位出现位置,b移动执行 和 操作的速度比br索引快7倍。同时,它保持了br索引良好的内存特性,例如,NCBI的RefSeq集合中所有可用的完整大肠杆菌基因组都可以编译成一个适合典型笔记本电脑内存的b移动索引。
b移动在泛基因组索引和查询方面被证明是实用且可扩展的。我们提供了b移动的C++实现,支持高效的无损近似模式匹配,包括定位功能,可在https://github.com/biointec/b-move上以AGPL-3.0许可获取。