Almagor H
J Theor Biol. 1983 Oct 21;104(4):633-45. doi: 10.1016/0022-5193(83)90251-5.
We present a model by which we look at the DNA sequence as a Markov process. It has been suggested by several workers that some basic biological or chemical features of nucleic acids stand behind the frequencies of dinucleotides (doublets) in these chains. Comparing patterns of doublet frequencies in DNA of different organisms was shown to be a fruitful approach to some phylogenetic questions (Russel & Subak-Sharpe, 1977). Grantham (1978) formulated mRNA sequence indices, some of which involve certain doublet frequencies. He suggested that using these indices may provide indications of the molecular constraints existing during gene evolution. Nussinov (1981) has shown that a set of dinucleotide preference rules holds consistently for eukaryotes, and suggested a strong correlation between these rules and degenerate codon usage. Gruenbaum, Cedar & Razin (1982) found that methylation in eukaryotic DNA occurs exclusively at C-G sites. Important biological information thus seems to be contained in the doublet frequencies. One of the basic questions to be asked (the "correlation question") is to what extent are the 64 trinucleotide (triplet) frequencies measured in a sequence determined by the 16 doublet frequencies in the same sequence. The DNA is described here as a Markov process, with the nucleotides being outcomes of a sequence generator. Answering the correlation question mentioned above means finding the order of the Markov process. The difficulty is that natural sequences are of finite length, and statistical noise is quite strong. We show that even for a 16000 nucleotide long sequence (like that of the human mitochondrial genome) the finite length effect cannot be neglected. Using the Markov chain model, the correlation between doublet and triplet frequencies can, however, be determined even for finite sequences, taking proper account of the finite length. Two natural DNA sequences, the human mitochondrial genome and the SV40 DNA, are analysed as examples of the method.
我们提出了一个将DNA序列视为马尔可夫过程的模型。一些研究人员认为,这些链中二核苷酸(双联体)的频率背后存在着核酸的某些基本生物学或化学特征。比较不同生物体DNA中的双联体频率模式已被证明是解决一些系统发育问题的有效方法(拉塞尔和苏巴克-夏普,1977年)。格兰瑟姆(1978年)制定了mRNA序列指数,其中一些涉及特定的双联体频率。他认为使用这些指数可能会提供基因进化过程中存在的分子限制的迹象。努西诺夫(1981年)表明,一组二核苷酸偏好规则在真核生物中始终成立,并提出这些规则与简并密码子使用之间存在很强的相关性。格鲁恩鲍姆、雪松和拉津(1982年)发现真核生物DNA中的甲基化仅发生在C-G位点。因此,重要的生物学信息似乎包含在双联体频率中。要问的一个基本问题(“相关性问题”)是,在一个序列中测量的64种三核苷酸(三联体)频率在多大程度上由同一序列中的16种双联体频率决定。这里将DNA描述为一个马尔可夫过程,核苷酸是序列发生器的结果。回答上述相关性问题意味着找到马尔可夫过程的阶数。困难在于自然序列长度有限,统计噪声相当大。我们表明,即使对于16000个核苷酸长的序列(如人类线粒体基因组序列),有限长度效应也不能被忽视。然而,使用马尔可夫链模型,即使对于有限序列,在适当考虑有限长度的情况下,也可以确定双联体和三联体频率之间的相关性。作为该方法的示例,分析了两个人类自然DNA序列,即人类线粒体基因组和SV40 DNA。