Almeida Jonas S, Vinga Susana
Dept Biometry & Epidemiology, Medical Univ South Carolina, 135 Cannon street, Suite 303, PO Box 250835, Charleston, SC 29425, USA.
BMC Bioinformatics. 2002;3:6. doi: 10.1186/1471-2105-3-6. Epub 2002 Feb 5.
For over a decade the idea of representing biological sequences in a continuous coordinate space has maintained its appeal but not been fully realized. The basic idea is that any sequence of symbols may define trajectories in the continuous space conserving all its statistical properties. Ideally, such a representation would allow scale independent sequence analysis--without the context of fixed memory length. A simple example would consist on being able to infer the homology between two sequences solely by comparing the coordinates of any two homologous units.
We have successfully identified such an iterative function for bijective mapping psi of discrete sequences into objects of continuous state space that enable scale-independent sequence analysis. The technique, named Universal Sequence Mapping (USM), is applicable to sequences with an arbitrary length and arbitrary number of unique units and generates a representation where map distance estimates sequence similarity. The novel USM procedure is based on earlier work by these and other authors on the properties of Chaos Game Representation (CGR). The latter enables the representation of 4 unit type sequences (like DNA) as an order free Markov chain transition table. The properties of USM are illustrated with test data and can be verified for other data by using the accompanying web-based tool:http://bioinformatics.musc.edu/~jonas/usm/.
USM is shown to enable a statistical mechanics approach to sequence analysis. The scale independent representation frees sequence analysis from the need to assume a memory length in the investigation of syntactic rules.
十多年来,在连续坐标空间中表示生物序列的想法一直颇具吸引力,但尚未完全实现。其基本思想是,任何符号序列都可以在连续空间中定义轨迹,同时保留其所有统计特性。理想情况下,这样的表示将允许进行与尺度无关的序列分析——无需固定记忆长度的上下文。一个简单的例子是,仅通过比较任意两个同源单元的坐标,就能推断出两个序列之间的同源性。
我们成功地确定了这样一种迭代函数,用于将离散序列双射映射到连续状态空间的对象上,从而实现与尺度无关的序列分析。这项技术名为通用序列映射(USM),适用于任意长度和任意数量独特单元的序列,并生成一种表示形式,其中映射距离估计序列相似性。新颖的USM程序基于这些作者以及其他作者早期关于混沌游戏表示(CGR)性质的研究工作。后者能够将4种单元类型的序列(如DNA)表示为一个无顺序的马尔可夫链转移表。通过测试数据说明了USM的特性,并且可以使用随附的基于网络的工具:http://bioinformatics.musc.edu/~jonas/usm/ 对其他数据进行验证。
研究表明,USM能够实现一种用于序列分析的统计力学方法。与尺度无关的表示使序列分析在研究句法规则时无需假设记忆长度。