Guo Li-Lu
Department of Computer Science, Xidian University, Xi'an, Shaanxi, China.
J Comput Biol. 2025 Sep;32(9):865-878. doi: 10.1089/cmb.2024.0925. Epub 2025 May 2.
Pattern locating is a crucial step in various biological sequence analysis tasks. As a compressed full-text indexing technology, full-text minute-space index has been introduced for biological pattern locating over ultra-long genomes, with a low memory footprint and retrieving time independent of genome size. However, its locating time is limited by the number of occurrences of the biological pattern in the genome, and it is not efficient enough when dealing with mass-occurrence biological patterns. To solve this problem, we propose an efficient locating algorithm for mass-occurrence biological patterns in genomic sequence, namely Effloc. It is developed on two optimization techniques. One is that rankings with the same Burrows-Wheeler Transform character are organized into a group and calculated together, thereby reducing the number of last-to-first column () mapping operations required to jump forward to find suffix array (SA) sampling points; the other is to design a specific structure to record the jump status, thus avoiding the redundant mapping operations that exist in the process of finding SA sampling points for those adjacent patterns that share the same sampling point. Compared with the existing algorithm, Effloc can significantly reduce the number of time-consuming mapping operations in mass-occurrence pattern locating. Ablation experiments verified our algorithm's effectiveness, exhibiting faster locating speed compared with five state-of-the-art competing algorithms. The source code and data are released at https://github.com/Lilu-guo/Effloc.
模式定位是各种生物序列分析任务中的关键步骤。作为一种压缩全文索引技术,全文微空间索引已被引入用于超长基因组的生物模式定位,具有低内存占用且检索时间与基因组大小无关的特点。然而,其定位时间受基因组中生物模式出现次数的限制,在处理大量出现的生物模式时效率不够高。为了解决这个问题,我们提出了一种用于基因组序列中大量出现的生物模式的高效定位算法,即Effloc。它基于两种优化技术开发。一种是将具有相同Burrows-Wheeler变换字符的排名组织成一组并一起计算,从而减少向前跳跃以找到后缀数组(SA)采样点所需的从后到前列()映射操作的数量;另一种是设计一种特定结构来记录跳跃状态,从而避免在为那些共享相同采样点的相邻模式寻找SA采样点的过程中存在的冗余映射操作。与现有算法相比,Effloc可以显著减少大量出现模式定位中耗时的映射操作数量。消融实验验证了我们算法的有效性,与五种最先进的竞争算法相比,其定位速度更快。源代码和数据可在https://github.com/Lilu-guo/Effloc上获取。