Yang Zheng-Dao, Kuo Hsuan-Yu, Hsieh Po-Wei, Hung Jui-Hung
Department of Computer Science, National Yang Ming Chiao Tung University, Hsinchu, Taiwan.
Bioinformatics. 2024 Jun 19;40(7). doi: 10.1093/bioinformatics/btae409.
The Full-text index in Minute space (FM-index) is a memory-efficient data structure widely used in bioinformatics for solving the fundamental pattern-matching task of searching for short patterns within a long reference. With the demand for short query patterns, the k-ordered concept has been proposed for FM-indexes. However, few construction algorithms in the state of the art fully exploit this idea to achieve significant speedups in the pan-genome era.
We introduce the k-ordered Induced Suffix Sorting (kISS) for efficient construction and utilization of k-ordered FM-indexes. We present an algorithmic workflow for building k-ordered suffix arrays, incorporating two novel strategies to improve time and memory efficiency. We also demonstrate the compatibility of integrating k-ordered FM-indexes with locate operations in FMtree. Experiments show that kISS can improve the construction time, and the generated k-ordered suffix array can also be applied to FMtree without any additional in computation or memory usage.
https://github.com/jhhung/kISS.
Supplementary data are available at Bioinformatics online.
微空间全文索引(FM索引)是一种内存高效的数据结构,在生物信息学中广泛用于解决在长参考序列中搜索短模式的基本模式匹配任务。随着对短查询模式的需求,已针对FM索引提出了k阶概念。然而,目前的技术水平中很少有构建算法能充分利用这一思想在泛基因组时代实现显著加速。
我们引入了k阶诱导后缀排序(kISS),以高效构建和利用k阶FM索引。我们提出了一种构建k阶后缀数组的算法工作流程,纳入了两种新颖策略以提高时间和内存效率。我们还展示了将k阶FM索引与FM树中的定位操作集成的兼容性。实验表明,kISS可以缩短构建时间,并且生成的k阶后缀数组也可以应用于FM树,而无需任何额外的计算或内存使用。
https://github.com/jhhung/kISS。
补充数据可在《生物信息学》在线获取。