Vinga Susana, Almeida Jonas S
Biomathematics Group, Instituto de Tecnologia Química e Biológica, Universidade Nova de Lisboa, R. Qta. Grande 6, 2780-156 Oeiras, Portugal.
J Theor Biol. 2004 Dec 7;231(3):377-88. doi: 10.1016/j.jtbi.2004.06.030.
Entropy measures of DNA sequences estimate their randomness or, inversely, their repeatability. L-block Shannon discrete entropy accounts for the empirical distribution of all length-L words and has convergence problems for finite sequences. A new entropy measure that extends Shannon's formalism is proposed. Renyi's quadratic entropy calculated with Parzen window density estimation method applied to CGR/USM continuous maps of DNA sequences constitute a novel technique to evaluate sequence global randomness without some of the former method drawbacks. The asymptotic behaviour of this new measure was analytically deduced and the calculation of entropies for several synthetic and experimental biological sequences was performed. The results obtained were compared with the distributions of the null model of randomness obtained by simulation. The biological sequences have shown a different p-value according to the kernel resolution of Parzen's method, which might indicate an unknown level of organization of their patterns. This new technique can be very useful in the study of DNA sequence complexity and provide additional tools for DNA entropy estimation. The main MATLAB applications developed and additional material are available at the webpage . Specialized functions can be obtained from the authors.
DNA序列的熵度量估计其随机性,或者反过来估计其重复性。L块香农离散熵考虑了所有长度为L的单词的经验分布,并且对于有限序列存在收敛问题。提出了一种扩展香农形式主义的新熵度量。用帕曾窗密度估计方法计算的雷尼二次熵应用于DNA序列的CGR/USM连续映射,构成了一种评估序列全局随机性的新技术,而没有一些先前方法的缺点。分析推导了这种新度量的渐近行为,并对几个合成和实验生物序列进行了熵计算。将得到的结果与通过模拟获得的随机零模型的分布进行了比较。根据帕曾方法的核分辨率,生物序列显示出不同的p值,这可能表明其模式存在未知的组织水平。这种新技术在DNA序列复杂性研究中可能非常有用,并为DNA熵估计提供了额外的工具。开发的主要MATLAB应用程序和其他材料可在网页上获得。专门的函数可从作者处获得。