Vinga Susana, Almeida Jonas S
Instituto de Engenharia de Sistemas e Computadores: Investigação e Desenvolvimento (INESC-ID), R, Alves Redol 9, 1000-029 Lisboa, Portugal.
BMC Bioinformatics. 2007 Oct 16;8:393. doi: 10.1186/1471-2105-8-393.
In a recent report the authors presented a new measure of continuous entropy for DNA sequences, which allows the estimation of their randomness level. The definition therein explored was based on the Rényi entropy of probability density estimation (pdf) using the Parzen's window method and applied to Chaos Game Representation/Universal Sequence Maps (CGR/USM). Subsequent work proposed a fractal pdf kernel as a more exact solution for the iterated map representation. This report extends the concepts of continuous entropy by defining DNA sequence entropic profiles using the new pdf estimations to refine the density estimation of motifs.
The new methodology enables two results. On the one hand it shows that the entropic profiles are directly related with the statistical significance of motifs, allowing the study of under and over-representation of segments. On the other hand, by spanning the parameters of the kernel function it is possible to extract important information about the scale of each conserved DNA region. The computational applications, developed in Matlab m-code, the corresponding binary executables and additional material and examples are made publicly available at http://kdbio.inesc-id.pt/~svinga/ep/.
The ability to detect local conservation from a scale-independent representation of symbolic sequences is particularly relevant for biological applications where conserved motifs occur in multiple, overlapping scales, with significant future applications in the recognition of foreign genomic material and inference of motif structures.
在最近的一份报告中,作者提出了一种用于DNA序列的连续熵新度量,它可以估计DNA序列的随机性水平。其中所探讨的定义基于使用帕曾窗方法的概率密度估计(pdf)的雷尼熵,并应用于混沌游戏表示/通用序列图(CGR/USM)。后续工作提出了一种分形pdf核,作为迭代映射表示的更精确解决方案。本报告通过使用新的pdf估计定义DNA序列熵轮廓来扩展连续熵的概念,以完善基序的密度估计。
新方法产生了两个结果。一方面,它表明熵轮廓与基序的统计显著性直接相关,从而能够研究片段的代表性不足和过度代表性。另一方面,通过跨越核函数的参数,可以提取有关每个保守DNA区域规模的重要信息。用Matlab m代码开发的计算应用程序、相应的二进制可执行文件以及其他材料和示例可在http://kdbio.inesc-id.pt/~svinga/ep/上公开获取。
从符号序列的尺度无关表示中检测局部保守性的能力在生物学应用中尤为重要,因为保守基序出现在多个重叠尺度上,在识别外来基因组材料和推断基序结构方面具有重要的未来应用价值。