Santibánez-Koref M, Reich J G
Biomed Biochim Acta. 1986;45(6):737-48.
A statistical model for the assessment of suppressions or preferences of 16 dinucleotides in DNA sequences was developed. It is based on the description by a hypergeometric distribution of the doublet frequencies in randomly "scrambled" DNA sequences. The statistical test is sequential and extracts one after another dinucleotides that differ significantly from their expected values. It is shown that in mammalian DNA only TA and CG are consistently depressed in all three reading frame positions. The deviations of other dinucleotides are either restricted to one frame position or not significant. The possibility that the coding commitments of the DNA sequences may be the causes of the non-random distribution was studied. Only in position 1/2 of the reading frame is the frequency behavior of TA adequately explained by the amino acid sequence coded for. It is concluded that TA and CG are avoided wherever possible for reasons that do not reside in the coding function of mammalian DNA sequences.
开发了一种用于评估DNA序列中16种二核苷酸抑制或偏好的统计模型。它基于对随机“打乱”的DNA序列中双峰频率的超几何分布描述。统计检验是顺序进行的,逐个提取与其期望值有显著差异的二核苷酸。结果表明,在哺乳动物DNA中,只有TA和CG在所有三个阅读框位置都持续受到抑制。其他二核苷酸的偏差要么局限于一个阅读框位置,要么不显著。研究了DNA序列的编码特性可能是这种非随机分布原因的可能性。只有在阅读框的第1/2位置,TA的频率行为才能通过编码的氨基酸序列得到充分解释。得出的结论是,出于与哺乳动物DNA序列编码功能无关的原因,尽可能避免使用TA和CG。