Suppr超能文献

通过混沌游戏表示法进行模式匹配:为生物序列分析搭建数字与离散数据结构之间的桥梁。

Pattern matching through Chaos Game Representation: bridging numerical and discrete data structures for biological sequence analysis.

作者信息

Vinga Susana, Carvalho Alexandra M, Francisco Alexandre P, Russo Luís Ms, Almeida Jonas S

机构信息

Instituto de Engenharia de Sistemas e Computadores: Investigação e Desenvolvimento (INESC-ID), R, Alves Redol 9, 1000-029 Lisboa, Portugal.

出版信息

Algorithms Mol Biol. 2012 May 2;7(1):10. doi: 10.1186/1748-7188-7-10.

Abstract

BACKGROUND

Chaos Game Representation (CGR) is an iterated function that bijectively maps discrete sequences into a continuous domain. As a result, discrete sequences can be object of statistical and topological analyses otherwise reserved to numerical systems. Characteristically, CGR coordinates of substrings sharing an L-long suffix will be located within 2-L distance of each other. In the two decades since its original proposal, CGR has been generalized beyond its original focus on genomic sequences and has been successfully applied to a wide range of problems in bioinformatics. This report explores the possibility that it can be further extended to approach algorithms that rely on discrete, graph-based representations.

RESULTS

The exploratory analysis described here consisted of selecting foundational string problems and refactoring them using CGR-based algorithms. We found that CGR can take the role of suffix trees and emulate sophisticated string algorithms, efficiently solving exact and approximate string matching problems such as finding all palindromes and tandem repeats, and matching with mismatches. The common feature of these problems is that they use longest common extension (LCE) queries as subtasks of their procedures, which we show to have a constant time solution with CGR. Additionally, we show that CGR can be used as a rolling hash function within the Rabin-Karp algorithm.

CONCLUSIONS

The analysis of biological sequences relies on algorithmic foundations facing mounting challenges, both logistic (performance) and analytical (lack of unifying mathematical framework). CGR is found to provide the latter and to promise the former: graph-based data structures for sequence analysis operations are entailed by numerical-based data structures produced by CGR maps, providing a unifying analytical framework for a diversity of pattern matching problems.

摘要

背景

混沌游戏表示法(CGR)是一种迭代函数,它将离散序列双射地映射到连续域中。因此,离散序列可以成为统计和拓扑分析的对象,而这些分析原本是针对数值系统的。其特点是,共享长度为L的后缀的子串的CGR坐标将位于彼此2-L距离之内。自最初提出以来的二十年里,CGR已从最初专注于基因组序列得到了扩展,并已成功应用于生物信息学中的广泛问题。本报告探讨了它能否进一步扩展以应用于依赖离散的、基于图的表示法的算法。

结果

此处描述的探索性分析包括选择基础字符串问题,并使用基于CGR的算法对其进行重构。我们发现CGR可以起到后缀树的作用,并模拟复杂的字符串算法,有效地解决精确和近似字符串匹配问题,如查找所有回文和串联重复序列,以及带错配的匹配。这些问题的共同特征是它们将最长公共扩展(LCE)查询用作其过程的子任务,我们证明使用CGR可以在恒定时间内解决这些查询。此外,我们表明CGR可以在拉宾 - 卡普算法中用作滚动哈希函数。

结论

对生物序列的分析依赖于面临越来越多挑战的算法基础,这些挑战既有逻辑方面的(性能),也有分析方面的(缺乏统一的数学框架)。我们发现CGR提供了后者并有望解决前者:CGR映射产生的基于数值的数据结构带来了用于序列分析操作的基于图的数据结构,为各种模式匹配问题提供了统一的分析框架。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f369/3402988/0f2f980d0634/1748-7188-7-10-1.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验