• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

通过混沌游戏表示法进行模式匹配:为生物序列分析搭建数字与离散数据结构之间的桥梁。

Pattern matching through Chaos Game Representation: bridging numerical and discrete data structures for biological sequence analysis.

作者信息

Vinga Susana, Carvalho Alexandra M, Francisco Alexandre P, Russo Luís Ms, Almeida Jonas S

机构信息

Instituto de Engenharia de Sistemas e Computadores: Investigação e Desenvolvimento (INESC-ID), R, Alves Redol 9, 1000-029 Lisboa, Portugal.

出版信息

Algorithms Mol Biol. 2012 May 2;7(1):10. doi: 10.1186/1748-7188-7-10.

DOI:10.1186/1748-7188-7-10
PMID:22551152
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC3402988/
Abstract

BACKGROUND

Chaos Game Representation (CGR) is an iterated function that bijectively maps discrete sequences into a continuous domain. As a result, discrete sequences can be object of statistical and topological analyses otherwise reserved to numerical systems. Characteristically, CGR coordinates of substrings sharing an L-long suffix will be located within 2-L distance of each other. In the two decades since its original proposal, CGR has been generalized beyond its original focus on genomic sequences and has been successfully applied to a wide range of problems in bioinformatics. This report explores the possibility that it can be further extended to approach algorithms that rely on discrete, graph-based representations.

RESULTS

The exploratory analysis described here consisted of selecting foundational string problems and refactoring them using CGR-based algorithms. We found that CGR can take the role of suffix trees and emulate sophisticated string algorithms, efficiently solving exact and approximate string matching problems such as finding all palindromes and tandem repeats, and matching with mismatches. The common feature of these problems is that they use longest common extension (LCE) queries as subtasks of their procedures, which we show to have a constant time solution with CGR. Additionally, we show that CGR can be used as a rolling hash function within the Rabin-Karp algorithm.

CONCLUSIONS

The analysis of biological sequences relies on algorithmic foundations facing mounting challenges, both logistic (performance) and analytical (lack of unifying mathematical framework). CGR is found to provide the latter and to promise the former: graph-based data structures for sequence analysis operations are entailed by numerical-based data structures produced by CGR maps, providing a unifying analytical framework for a diversity of pattern matching problems.

摘要

背景

混沌游戏表示法(CGR)是一种迭代函数,它将离散序列双射地映射到连续域中。因此,离散序列可以成为统计和拓扑分析的对象,而这些分析原本是针对数值系统的。其特点是,共享长度为L的后缀的子串的CGR坐标将位于彼此2-L距离之内。自最初提出以来的二十年里,CGR已从最初专注于基因组序列得到了扩展,并已成功应用于生物信息学中的广泛问题。本报告探讨了它能否进一步扩展以应用于依赖离散的、基于图的表示法的算法。

结果

此处描述的探索性分析包括选择基础字符串问题,并使用基于CGR的算法对其进行重构。我们发现CGR可以起到后缀树的作用,并模拟复杂的字符串算法,有效地解决精确和近似字符串匹配问题,如查找所有回文和串联重复序列,以及带错配的匹配。这些问题的共同特征是它们将最长公共扩展(LCE)查询用作其过程的子任务,我们证明使用CGR可以在恒定时间内解决这些查询。此外,我们表明CGR可以在拉宾 - 卡普算法中用作滚动哈希函数。

结论

对生物序列的分析依赖于面临越来越多挑战的算法基础,这些挑战既有逻辑方面的(性能),也有分析方面的(缺乏统一的数学框架)。我们发现CGR提供了后者并有望解决前者:CGR映射产生的基于数值的数据结构带来了用于序列分析操作的基于图的数据结构,为各种模式匹配问题提供了统一的分析框架。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f369/3402988/4e63f013a68b/1748-7188-7-10-2.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f369/3402988/0f2f980d0634/1748-7188-7-10-1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f369/3402988/4e63f013a68b/1748-7188-7-10-2.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f369/3402988/0f2f980d0634/1748-7188-7-10-1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f369/3402988/4e63f013a68b/1748-7188-7-10-2.jpg

相似文献

1
Pattern matching through Chaos Game Representation: bridging numerical and discrete data structures for biological sequence analysis.通过混沌游戏表示法进行模式匹配:为生物序列分析搭建数字与离散数据结构之间的桥梁。
Algorithms Mol Biol. 2012 May 2;7(1):10. doi: 10.1186/1748-7188-7-10.
2
Mathematical characterization of Chaos Game Representation. New algorithms for nucleotide sequence analysis.混沌游戏表示法的数学特征。核苷酸序列分析的新算法。
J Mol Biol. 1992 Dec 5;228(3):715-9. doi: 10.1016/0022-2836(92)90857-g.
3
A novel numerical representation for proteins: Three-dimensional Chaos Game Representation and its Extended Natural Vector.一种蛋白质的新型数值表示:三维混沌博弈表示及其扩展自然向量。
Comput Struct Biotechnol J. 2020 Jul 15;18:1904-1913. doi: 10.1016/j.csbj.2020.07.004. eCollection 2020.
4
Multifarious aspects of the chaos game representation and its applications in biological sequence analysis.混沌游戏表示法的多方面及其在生物序列分析中的应用。
Comput Biol Med. 2022 Dec;151(Pt A):106243. doi: 10.1016/j.compbiomed.2022.106243. Epub 2022 Oct 25.
5
Analysis of genomic sequences by Chaos Game Representation.通过混沌游戏表示法分析基因组序列。
Bioinformatics. 2001 May;17(5):429-37. doi: 10.1093/bioinformatics/17.5.429.
6
Numerical encoding of DNA sequences by chaos game representation with application in similarity comparison.基于混沌游戏表示的DNA序列数值编码及其在相似性比较中的应用
Genomics. 2016 Oct;108(3-4):134-142. doi: 10.1016/j.ygeno.2016.08.002. Epub 2016 Aug 15.
7
Chaos game representation and its applications in bioinformatics.混沌游戏表示法及其在生物信息学中的应用。
Comput Struct Biotechnol J. 2021 Nov 10;19:6263-6271. doi: 10.1016/j.csbj.2021.11.008. eCollection 2021.
8
Fast and accurate genome comparison using genome images: The Extended Natural Vector Method.使用基因组图像进行快速准确的基因组比较:扩展自然向量方法。
Mol Phylogenet Evol. 2019 Dec;141:106633. doi: 10.1016/j.ympev.2019.106633. Epub 2019 Sep 26.
9
Biological sequences as pictures: a generic two dimensional solution for iterated maps.作为图像的生物序列:迭代映射的通用二维解决方案。
BMC Bioinformatics. 2009 Mar 31;10:100. doi: 10.1186/1471-2105-10-100.
10
Analysis of earthquake hypocenter characteristics using chaos game representation.利用混沌博弈表示法分析地震震源特征
Comput Geosci (Bassum). 2023;27(1):143-157. doi: 10.1007/s10596-022-10187-x. Epub 2022 Dec 26.

引用本文的文献

1
Efficient Storage and Analysis of Genomic Data: A k-mer Frequency Mapping and Image Representation Method.基因组数据的高效存储与分析:一种k-mer频率映射与图像表示方法。
Interdiscip Sci. 2024 Oct 21. doi: 10.1007/s12539-024-00659-2.
2
In silico identification of multiple conserved motifs within the control region of Culicidae mitogenomes.基于丝氨酸蛋白酶抑制剂基因构建的系统发育树分析在双翅目昆虫系统发育研究中的应用
Sci Rep. 2022 Dec 19;12(1):21920. doi: 10.1038/s41598-022-26236-5.
3
: rapid alignment-free prediction of sequence alignment identity scores using self-supervised general linear models.

本文引用的文献

1
The human genome: a multifractal analysis.人类基因组:多重分形分析。
BMC Genomics. 2011 Oct 14;12:506. doi: 10.1186/1471-2164-12-506.
2
Efficient alignment of pyrosequencing reads for re-sequencing applications.用于重测序应用的焦磷酸测序reads 的高效比对。
BMC Bioinformatics. 2011 May 16;12:163. doi: 10.1186/1471-2105-12-163.
3
Using genomic signatures for HIV-1 sub-typing.利用基因组特征进行 HIV-1 亚型分型。
使用自监督通用线性模型快速无比对预测序列比对同一性得分
NAR Genom Bioinform. 2021 Feb 1;3(1):lqab001. doi: 10.1093/nargab/lqab001. eCollection 2021 Mar.
4
A systematic comparison of chloroplast genome assembly tools.系统比较叶绿体基因组组装工具。
Genome Biol. 2020 Sep 28;21(1):254. doi: 10.1186/s13059-020-02153-6.
5
Biocomplexity and Fractality in the Search of Biomarkers of Aging and Pathology: Mitochondrial DNA Profiling of Parkinson's Disease.生物复杂性和分形在衰老和病理学生物标志物搜索中的应用:帕金森病的线粒体 DNA 分析。
Int J Mol Sci. 2020 Mar 4;21(5):1758. doi: 10.3390/ijms21051758.
6
Read-SpaM: assembly-free and alignment-free comparison of bacterial genomes with low sequencing coverage.Read-SpaM:用于低测序覆盖度细菌基因组的无组装和无比对比较。
BMC Bioinformatics. 2019 Dec 17;20(Suppl 20):638. doi: 10.1186/s12859-019-3205-7.
7
Comparison of different annotation tools for characterization of the complete chloroplast genome of Corylus avellana cv Tombul.不同注释工具在鉴定土耳其榛子 cv Tombul 完整叶绿体基因组特征中的比较。
BMC Genomics. 2019 Nov 20;20(1):874. doi: 10.1186/s12864-019-6253-5.
8
Prot-SpaM: fast alignment-free phylogeny reconstruction based on whole-proteome sequences.Prot-SpaM:基于全蛋白质组序列的快速无比对系统发育重建。
Gigascience. 2019 Mar 1;8(3). doi: 10.1093/gigascience/giy148.
9
Phylogeny reconstruction based on the length distribution of -mismatch common substrings.基于错配公共子串长度分布的系统发育重建。
Algorithms Mol Biol. 2017 Dec 11;12:27. doi: 10.1186/s13015-017-0118-8. eCollection 2017.
10
A survey and evaluations of histogram-based statistics in alignment-free sequence comparison.基于直方图的无比对序列比较统计的调查与评估。
Brief Bioinform. 2019 Jul 19;20(4):1222-1237. doi: 10.1093/bib/bbx161.
BMC Bioinformatics. 2010 Jan 18;11 Suppl 1(Suppl 1):S26. doi: 10.1186/1471-2105-11-S1-S26.
4
A web server for interactive and zoomable Chaos Game Representation images.一个用于交互式和可缩放混沌游戏表示图像的网络服务器。
Source Code Biol Med. 2009 Sep 17;4:6. doi: 10.1186/1751-0473-4-6.
5
Chaos game representation of human pallidal spike trains.人类苍白球棘波序列的混沌游戏表示法。
J Biol Phys. 2010 Mar;36(2):197-205. doi: 10.1007/s10867-009-9172-x. Epub 2009 Aug 18.
6
SOAP2: an improved ultrafast tool for short read alignment.SOAP2:一种用于短读序列比对的改进型超快速工具。
Bioinformatics. 2009 Aug 1;25(15):1966-7. doi: 10.1093/bioinformatics/btp336. Epub 2009 Jun 3.
7
Unstable tandem repeats in promoters confer transcriptional evolvability.启动子中的不稳定串联重复序列赋予转录可塑性。
Science. 2009 May 29;324(5931):1213-6. doi: 10.1126/science.1170097.
8
Fast and accurate short read alignment with Burrows-Wheeler transform.使用Burrows-Wheeler变换进行快速准确的短读比对。
Bioinformatics. 2009 Jul 15;25(14):1754-60. doi: 10.1093/bioinformatics/btp324. Epub 2009 May 18.
9
Biological sequences as pictures: a generic two dimensional solution for iterated maps.作为图像的生物序列:迭代映射的通用二维解决方案。
BMC Bioinformatics. 2009 Mar 31;10:100. doi: 10.1186/1471-2105-10-100.
10
Ultrafast and memory-efficient alignment of short DNA sequences to the human genome.短DNA序列与人类基因组的超快速且内存高效比对。
Genome Biol. 2009;10(3):R25. doi: 10.1186/gb-2009-10-3-r25. Epub 2009 Mar 4.