Fofanov Yuriy, Luo Yi, Katili Charles, Wang Jim, Belosludtsev Yuri, Powdrill Thomas, Belapurkar Chetan, Fofanov Viacheslav, Li Tong-Bin, Chumakov Sergey, Pettitt B Montgomery
Department of Computer Science, University of Houston, TX 77204-3010, USA.
Bioinformatics. 2004 Oct 12;20(15):2421-8. doi: 10.1093/bioinformatics/bth266. Epub 2004 Apr 15.
Analysis of statistical properties of DNA sequences is important for evolutional biology as well as for DNA probe and PCR technologies. These technologies, in turn, can be used for organism identification, which implies applications in the diagnosis of infectious diseases, environmental studies, etc.
We present results of the correlation analysis of distributions of the presence/absence of short nucleotide subsequences of different length ('n-mers', n = 5-20) in more than 1500 microbial and virus genomes, together with five genomes of multicellular organisms (including human). We calculate whether a given n-mer is present or absent (frequency of presence) in a given genome, which is not the usually calculated number of appearances of n-mers in one or more genomes (frequency of appearance). For organisms that are not close relatives of each other, the presence/absence of different 7-20mers in their genomes are not correlated. For close biological relatives, some correlation of the presence of n-mers in this range appears, but is not as strong as expected. Suppressed correlations among the n-mers present in different genomes leads to the possibility of using random sets of n-mers (with appropriately chosen n) to discriminate genomes of different organisms and possibly individual genomes of the same species including human with a low probability of error.
对DNA序列的统计特性进行分析,对于进化生物学以及DNA探针和聚合酶链式反应(PCR)技术而言都很重要。反过来,这些技术可用于生物体鉴定,这意味着可应用于传染病诊断、环境研究等领域。
我们展示了对1500多个微生物和病毒基因组以及五个多细胞生物(包括人类)基因组中不同长度(“n聚体”,n = 5 - 20)的短核苷酸子序列的存在/缺失分布进行相关性分析的结果。我们计算给定的n聚体在给定基因组中是存在还是不存在(存在频率),这与通常计算的n聚体在一个或多个基因组中的出现次数(出现频率)不同。对于彼此关系不密切的生物体,它们基因组中不同的7 - 20聚体的存在/缺失不相关。对于亲缘关系密切的生物体,该范围内n聚体的存在存在一定相关性,但不如预期的强。不同基因组中存在的n聚体之间的相关性受到抑制,这使得有可能使用随机的n聚体集合(n选择合适)来区分不同生物体的基因组,甚至有可能以较低的错误概率区分同一物种(包括人类)的个体基因组。