School of Computer Science, University of Science and Technology of China, Heifei, Anhui 230027, P. R. China.
Key Laboratory on High Performance Computing, Anhui Province, P. R. China.
J Bioinform Comput Biol. 2024 Aug;22(4):2450019. doi: 10.1142/S0219720024500197. Epub 2024 Aug 31.
The graph of sequences represents the genetic variations of pan-genome concisely and space-efficiently than multiple linear reference genome. In order to accelerate aligning reads to the graph, an index of graph-based reference genomes is used to obtain candidate locations. However, the potential combinatorial explosion of nodes on the sequence graph leads to increasing the index space and maximum memory usage of alignment process considerably, especially for large-scale datasets. For this, existing methods typically attempt to prune complex regions, or extend the length of seeds, which sacrifices the recall of alignment algorithm despite reducing space usage slightly. We present the and alignment algorithm , capable of indexing and aligning at the lower memory cost. SIG builds the non-overlapping minimizers index inside nodes of sequence graph and SIG-Aligner filters out most of the false positive matches by the method based on the pigeonhole principle. Compared to Giraffe, the results of computational experiments show that SIG achieves a significant reduction in index memory space ranging from 50% to 75% for the human pan-genome graphs, while still preserving superior or comparable accuracy of alignment and the faster alignment time.
序列图比多个线性参考基因组更简洁、更有效地表示泛基因组的遗传变异。为了加速将读取序列与图谱对齐,使用基于图谱的参考基因组索引来获取候选位置。然而,序列图谱上节点的潜在组合爆炸会导致索引空间和对齐过程的最大内存使用量大大增加,尤其是对于大规模数据集。为此,现有方法通常试图修剪复杂区域,或延长种子的长度,这会牺牲对齐算法的召回率,尽管略微减少了空间使用。我们提出了 和 算法,能够以较低的内存成本进行索引和对齐。SIG 在序列图的节点内构建非重叠最小化索引,SIG-Aligner 通过基于鸽笼原理的方法过滤掉大多数假阳性匹配。与 Giraffe 相比,计算实验的结果表明,SIG 实现了指数级的索引内存空间显著减少,范围从人类泛基因组图谱的 50%到 75%,同时仍然保持了优越或相当的对齐准确性和更快的对齐时间。