Vinga Susana, Carvalho Alexandra M, Francisco Alexandre P, Russo Luís Ms, Almeida Jonas S
Instituto de Engenharia de Sistemas e Computadores: Investigação e Desenvolvimento (INESC-ID), R, Alves Redol 9, 1000-029 Lisboa, Portugal.
Algorithms Mol Biol. 2012 May 2;7(1):10. doi: 10.1186/1748-7188-7-10.
Chaos Game Representation (CGR) is an iterated function that bijectively maps discrete sequences into a continuous domain. As a result, discrete sequences can be object of statistical and topological analyses otherwise reserved to numerical systems. Characteristically, CGR coordinates of substrings sharing an L-long suffix will be located within 2-L distance of each other. In the two decades since its original proposal, CGR has been generalized beyond its original focus on genomic sequences and has been successfully applied to a wide range of problems in bioinformatics. This report explores the possibility that it can be further extended to approach algorithms that rely on discrete, graph-based representations.
The exploratory analysis described here consisted of selecting foundational string problems and refactoring them using CGR-based algorithms. We found that CGR can take the role of suffix trees and emulate sophisticated string algorithms, efficiently solving exact and approximate string matching problems such as finding all palindromes and tandem repeats, and matching with mismatches. The common feature of these problems is that they use longest common extension (LCE) queries as subtasks of their procedures, which we show to have a constant time solution with CGR. Additionally, we show that CGR can be used as a rolling hash function within the Rabin-Karp algorithm.
The analysis of biological sequences relies on algorithmic foundations facing mounting challenges, both logistic (performance) and analytical (lack of unifying mathematical framework). CGR is found to provide the latter and to promise the former: graph-based data structures for sequence analysis operations are entailed by numerical-based data structures produced by CGR maps, providing a unifying analytical framework for a diversity of pattern matching problems.
混沌游戏表示法(CGR)是一种迭代函数,它将离散序列双射地映射到连续域中。因此,离散序列可以成为统计和拓扑分析的对象,而这些分析原本是针对数值系统的。其特点是,共享长度为L的后缀的子串的CGR坐标将位于彼此2-L距离之内。自最初提出以来的二十年里,CGR已从最初专注于基因组序列得到了扩展,并已成功应用于生物信息学中的广泛问题。本报告探讨了它能否进一步扩展以应用于依赖离散的、基于图的表示法的算法。
此处描述的探索性分析包括选择基础字符串问题,并使用基于CGR的算法对其进行重构。我们发现CGR可以起到后缀树的作用,并模拟复杂的字符串算法,有效地解决精确和近似字符串匹配问题,如查找所有回文和串联重复序列,以及带错配的匹配。这些问题的共同特征是它们将最长公共扩展(LCE)查询用作其过程的子任务,我们证明使用CGR可以在恒定时间内解决这些查询。此外,我们表明CGR可以在拉宾 - 卡普算法中用作滚动哈希函数。
对生物序列的分析依赖于面临越来越多挑战的算法基础,这些挑战既有逻辑方面的(性能),也有分析方面的(缺乏统一的数学框架)。我们发现CGR提供了后者并有望解决前者:CGR映射产生的基于数值的数据结构带来了用于序列分析操作的基于图的数据结构,为各种模式匹配问题提供了统一的分析框架。