Öztürk Ünsal, Mattavelli Marco, Ribeca Paolo
SCI-STI-MM, EPFL, ELB 118, Station 11, 1015, Lausanne, Switzerland.
Biomathematics and Statistics Scotland, The James Hutton Institute, Peter Guthrie Tait Road, EH9 3FD, Edinburgh, United Kingdom.
NAR Genom Bioinform. 2024 Dec 11;6(4):lqae159. doi: 10.1093/nargab/lqae159. eCollection 2024 Dec.
This paper presents a new data structure, GIN-TONIC (raph dexing hrough ptimal ear nterval ompaction), designed to index arbitrary string-labelled directed graphs representing, for instance, pangenomes or transcriptomes. GIN-TONIC provides several capabilities not offered by other graph-indexing methods based on the FM-Index. It is non-hierarchical, handling a graph as a monolithic object; it indexes at nucleotide resolution all possible walks in the graph without the need to explicitly store them; it supports exact substring queries in polynomial time and space for all possible walk roots in the graph, even if there are exponentially many walks corresponding to such roots. Specific ad-hoc optimizations, such as precomputed caches, allow GIN-TONIC to achieve excellent performance for input graphs of various topologies and sizes. Robust scalability capabilities and a querying performance close to that of a linear FM-Index are demonstrated for two real-world applications on the scale of human pangenomes and transcriptomes. Source code and associated benchmarks are available on GitHub.
本文提出了一种新的数据结构GIN-TONIC(通过最优耳区间压缩进行图索引),旨在对任意字符串标记的有向图进行索引,例如代表泛基因组或转录组的图。GIN-TONIC提供了基于FM索引的其他图索引方法所没有的几种功能。它是非分层的,将图作为一个整体对象来处理;它以核苷酸分辨率对图中所有可能的路径进行索引,而无需显式存储它们;它支持在多项式时间和空间内对图中所有可能的路径根进行精确子串查询,即使对应于这些根的路径数量呈指数级增长。特定的临时优化,如预计算缓存,使GIN-TONIC能够在各种拓扑结构和大小的输入图上实现出色的性能。针对人类泛基因组和转录组规模的两个实际应用,展示了强大的可扩展性能力以及接近线性FM索引的查询性能。源代码和相关基准测试可在GitHub上获取。