Mrázek J, Kypr J
Institute of Biophysics, Academy of Sciences of the Czech Republic, Brno.
Comput Appl Biosci. 1995 Apr;11(2):195-9. doi: 10.1093/bioinformatics/11.2.195.
We propose a novel, transparent and very simple algorithm to analyze middle-range correlations in genomic nucleotide sequences. Analysis by this algorithm of the EMBL Nucleotide Sequence Database demonstrates that all four nucleotides cluster in the genomic nucleotide sequences of eukaryotes on the scale of several hundred base pairs. In prokaryotes, the clustering is weak but still evident. The non-dominant three bases are deficient in the clusters, while A is the most deficient nucleotide in the clusters of C, and vice versa, and G is the most deficient nucleotide in the clusters of T, and vice versa. The algorithm also detects CG islands, extending over 1 kb, in vertebrate sequences. In plants, the CG islands are shown to be much smaller, if they exist at all. A clustering tendency is also exhibited by the TA doublet. Other doublets do not cluster. We observe no strong correlation between nucleotides separated in genomes by > 1 kb.
我们提出了一种新颖、透明且非常简单的算法,用于分析基因组核苷酸序列中的中程相关性。通过该算法对EMBL核苷酸序列数据库进行分析表明,在数百个碱基对的尺度上,真核生物基因组核苷酸序列中的所有四种核苷酸都会聚类。在原核生物中,聚类较弱但仍然明显。非优势的三个碱基在聚类中缺乏,而在C的聚类中A是最缺乏的核苷酸,反之亦然,在T的聚类中G是最缺乏的核苷酸,反之亦然。该算法还在脊椎动物序列中检测到延伸超过1kb的CG岛。在植物中,如果存在CG岛的话,它们要小得多。TA双峰也表现出聚类趋势。其他双峰不聚类。我们观察到在基因组中相隔>1kb的核苷酸之间没有强相关性。