Leung M Y, Marsh G M, Speed T P
Division of Mathematics and Statistics, University of Texas at San Antonio 78249, USA.
J Comput Biol. 1996 Fall;3(3):345-60. doi: 10.1089/cmb.1996.3.345.
The relative abundance and rarity of DNA words have been recognized in previous biological studies to have implications for the regulation, repair, and evolutionary mechanisms of a genome. In this paper, we review several different measures of abundance and rarity of DNA words, including z-scores, representation ratios, and cross-ratios, that have appeared in the recent literature, and examine the concordance among them using the human cytomegalovirus genome sequence. We then rank all words of length k = 2, ..., 5 of seven herpesvirus genomes according to their abundance, as measured by one of the z-scores based upon a stationary Markov model of order k-2. Using a simple metric on the ranks of 2-words of the seven herpesvirus sequences, we construct an evolutionary tree. Several 3-words are observed to be consistently over- or underrepresented in all seven herpesviruses. Furthermore, clusters of some of the most over- and underrepresented 4- and 5-words in the genomes are identified with functional sites such as the origins of replication and regulatory signals of individual viruses.
在以往的生物学研究中,人们已经认识到DNA单词的相对丰度和稀有性对基因组的调控、修复及进化机制具有重要意义。在本文中,我们回顾了近年来文献中出现的几种不同的DNA单词丰度和稀有性度量方法,包括z分数、表述比例和交叉比例,并使用人类巨细胞病毒基因组序列检验它们之间的一致性。然后,我们根据基于k - 2阶平稳马尔可夫模型的z分数之一所测量的丰度,对七种疱疹病毒基因组长度为k = 2, ..., 5的所有单词进行排序。使用关于七种疱疹病毒序列中2 - 单词排名的简单度量方法,我们构建了一棵进化树。观察到几个3 - 单词在所有七种疱疹病毒中始终存在过度或不足的表述。此外,基因组中一些过度和不足表述最为明显的4 - 和5 - 单词簇与诸如单个病毒的复制起点和调控信号等功能位点相关联。