Depuydt Lore, Renders Luca, de Vyver Simon Van, Veys Lennart, Gagie Travis, Fostier Jan
Ghent University - imec, Technologiepark 126, 9052 Ghent, Belgium.
Ghent University, Technologiepark 126, 9052 Ghent, Belgium.
bioRxiv. 2024 Jun 2:2024.05.30.596587. doi: 10.1101/2024.05.30.596587.
Due to the increasing availability of high-quality genome sequences, pan-genomes are gradually replacing single consensus reference genomes in many bioinformatics pipelines to better capture genetic diversity. Traditional bioinformatics tools using the FM-index face memory limitations with such large genome collections. Recent advancements in run-length compressed indices like Gagie et al.'s r-index and Nishimoto and Tabei's move structure, alleviate memory constraints but focus primarily on backward search for MEM-finding. Arakawa et al.'s br-index initiates complete approximate pattern matching using bidirectional search in run-length compressed space, but with significant computational overhead due to complex memory access patterns. We introduce b-move, a novel bidirectional extension of the move structure, enabling fast, cache-efficient bidirectional character extensions in run-length compressed space. It achieves bidirectional character extensions up to 8 times faster than the br-index, closing the performance gap with FM-index-based alternatives, while maintaining the br-index's favorable memory characteristics. For example, all available complete genomes on NCBI's RefSeq collection can be compiled into a b-move index that fits into the RAM of a typical laptop. Thus, b-move proves practical and scalable for pan-genome indexing and querying. We provide a C++ implementation of b-move, supporting efficient lossless approximate pattern matching including locate functionality, available at https://github.com/biointec/b-move under the AGPL-3.0 license.
由于高质量基因组序列的可得性不断提高,在许多生物信息学流程中,泛基因组正逐渐取代单一的一致性参考基因组,以更好地捕捉遗传多样性。使用FM索引的传统生物信息学工具在处理如此庞大的基因组集合时面临内存限制。像Gagie等人的r索引和Nishimoto与Tabei的移动结构这样的游程长度压缩索引的最新进展,缓解了内存限制,但主要侧重于用于查找MEM的反向搜索。Arakawa等人的br索引在游程长度压缩空间中使用双向搜索启动完全近似模式匹配,但由于复杂的内存访问模式而存在显著的计算开销。我们引入了b移动,它是移动结构的一种新颖的双向扩展,能够在游程长度压缩空间中实现快速、缓存高效的双向字符扩展。它实现双向字符扩展的速度比br索引快8倍,缩小了与基于FM索引的替代方案之间的性能差距,同时保持了br索引良好的内存特性。例如,NCBI的RefSeq集合中所有可用的完整基因组都可以编译成一个适合典型笔记本电脑内存的b移动索引。因此,b移动在泛基因组索引和查询方面被证明是实用且可扩展的。我们提供了b移动的C++实现,支持包括定位功能在内的高效无损近似模式匹配,可在https://github.com/biointec/b-move上以AGPL-3.0许可获取。