Arnold J, Cuticchia A J, Newsome D A, Jennings W W, Ivarie R
Department of Genetics, University of Georgia, Athens 30602.
Nucleic Acids Res. 1988 Jul 25;16(14B):7145-58. doi: 10.1093/nar/16.14.7145.
Here we compare several methods for predicting oligonucleotide frequencies in 392 kb of yeast DNA. As in previous work on E. coli, a relatively simple equation based on tetranucleotide frequencies can be used in predicting the frequencies of longer oligonucleotides. For example, the mean of observed/expected abundances of 4,096 hexamers was 1.00 with a sample standard deviation of .18. This simple predictor arises by considering each base on the sense strand of yeast to depend only on the three bases 5' to it (a 3rd order Markov chain) and is more accurate in estimating oligonucleotide frequencies than other statistical methods examined. This equation is useful in predicting restriction enzyme fragment sizes, selecting restriction enzymes that cut preferentially in coding vs noncoding regions, and in constructing detailed physical maps of whole genomes. When ranked highest to lowest abundance, the observed frequencies of oligomers of a given length (up to 6 bases) are closely tracked by the predicted abundances of a 3rd or 4th order Markov chain. These ordered abundance curves have a power curve shape with a broad linear range with a sharp break at the top end of the curve. There is also a strong disparity between the most and least abundant oligomer with for example a 79-fold variation between the most and least abundant hexamer. The curves reveal a strong dependence of oligomer frequencies on base composition. Unlike E. Coli, there is no sharp downturn at the low end of the curves and hence, no class of oligomers rare relative to other oligomers of the same length.
在此,我们比较了几种预测酵母DNA 392 kb中寡核苷酸频率的方法。如同先前对大肠杆菌的研究,基于四核苷酸频率的一个相对简单的方程可用于预测更长寡核苷酸的频率。例如,4096个六聚体的观察/预期丰度的平均值为1.00,样本标准差为0.18。这个简单的预测方法是通过考虑酵母有义链上的每个碱基仅依赖于其5'端的三个碱基(三阶马尔可夫链)得出的,并且在估计寡核苷酸频率方面比所研究的其他统计方法更准确。该方程在预测限制性酶切片段大小、选择优先切割编码区与非编码区的限制性酶以及构建全基因组的详细物理图谱方面很有用。当按丰度从高到低排序时,给定长度(最多6个碱基)的寡聚物的观察频率被三阶或四阶马尔可夫链的预测丰度紧密跟踪。这些有序的丰度曲线呈幂曲线形状,具有较宽的线性范围,在曲线顶端有一个急剧的转折。最丰富和最不丰富寡聚物之间也存在很大差异,例如最丰富和最不丰富的六聚体之间有79倍的差异。这些曲线揭示了寡聚物频率对碱基组成的强烈依赖性。与大肠杆菌不同,曲线低端没有急剧下降,因此不存在相对于相同长度的其他寡聚物而言稀少的一类寡聚物。