Suppr超能文献

大肠杆菌基因组的单核苷酸至六核苷酸组成:马尔可夫链分析

Mono- through hexanucleotide composition of the Escherichia coli genome: a Markov chain analysis.

作者信息

Phillips G J, Arnold J, Ivarie R

出版信息

Nucleic Acids Res. 1987 Mar 25;15(6):2611-26. doi: 10.1093/nar/15.6.2611.

Abstract

Several statistical methods were tested for accuracy in predicting observed frequencies of di- through hexanucleotides in 74,444 bp of E. coli DNA. A Markov chain was most accurate overall, whereas other methods, including a random model based on mononucleotide frequencies, were very inaccurate. When ranked highest to lowest abundance, the observed frequencies of oligonucleotides up to six bases in length in E. coli DNA were highly asymmetric. All ordered abundance plots had a wide linear range containing the majority of the oligomers which deviated sharply at the high and low ends of the curves. In general, values predicted by a Markov chain closely followed the overall shape of the ordered abundance curves. A simple equation was derived by which the frequency of any nucleotide longer than four bases in the E. coli genome (or any genome) can be relatively accurately estimated from the nested set of component tri- and tetranucleotides by serial application of a 3rd order Markov chain. The equation yielded a mean ratio of 1.03 +/- 0.94 for the observed-to-expected frequencies of the 4,096 hexanucleotides. Hence, the method is a relatively accurate but not perfect predictor of the length in nucleotides between hexanucleotide sites. Higher accuracy can be achieved using a 4th order Markov chain and larger data sets. The high asymmetry in oligonucleotide abundance means that in the E. coli genome of 4.2 X 10(6) bp many relatively short sequences of 7-9 bp are very rare or absent.

摘要

对几种统计方法进行了测试,以评估其预测大肠杆菌74444bp DNA中二至六核苷酸观察频率的准确性。总体而言,马尔可夫链最为准确,而其他方法,包括基于单核苷酸频率的随机模型,准确性则非常低。按丰度从高到低排序时,大肠杆菌DNA中长度达六个碱基的寡核苷酸观察频率高度不对称。所有有序丰度图都有一个宽线性范围,包含大多数寡聚物,这些寡聚物在曲线的高低两端急剧偏离。一般来说,马尔可夫链预测的值紧密跟随有序丰度曲线的整体形状。推导了一个简单方程,通过连续应用三阶马尔可夫链,可从组成三核苷酸和四核苷酸的嵌套集合中相对准确地估计大肠杆菌基因组(或任何基因组)中任何长度超过四个碱基的核苷酸频率。该方程得出4096个六核苷酸观察频率与预期频率的平均比值为1.03±0.94。因此,该方法是六核苷酸位点间核苷酸长度的相对准确但并非完美的预测器。使用四阶马尔可夫链和更大的数据集可实现更高的准确性。寡核苷酸丰度的高度不对称意味着在4.2×10⁶bp的大肠杆菌基因组中,许多7 - 9bp的相对短序列非常罕见或不存在。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/bb40/340672/2a308e610b8a/nar00250-0221-a.jpg

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验