Suppr超能文献

原核生物基因组中异常序列检测方法的比较

Comparison of Methods of Detection of Exceptional Sequences in Prokaryotic Genomes.

作者信息

Rusinov I S, Ershova A S, Karyagina A S, Spirin S A, Alexeevski A V

机构信息

Belozersky Institute of Physico-Chemical Biology, Lomonosov Moscow State University, Moscow, 119992, Russia.

出版信息

Biochemistry (Mosc). 2018 Feb;83(2):129-139. doi: 10.1134/S0006297918020050.

Abstract

Many proteins need recognition of specific DNA sequences for functioning. The number of recognition sites and their distribution along the DNA might be of biological importance. For example, the number of restriction sites is often reduced in prokaryotic and phage genomes to decrease the probability of DNA cleavage by restriction endonucleases. We call a sequence an exceptional one if its frequency in a genome significantly differs from one predicted by some mathematical model. An exceptional sequence could be either under- or over-represented, depending on its frequency in comparison with the predicted one. Exceptional sequences could be considered biologically meaningful, for example, as targets of DNA-binding proteins or as parts of abundant repetitive elements. Several methods to predict frequency of a short sequence in a genome, based on actual frequencies of certain its subsequences, are used. The most popular are methods based on Markov chain models. But any rigorous comparison of the methods has not previously been performed. We compared three methods for the prediction of short sequence frequencies: the maximum-order Markov chain model-based method, the method that uses geometric mean of extended Markovian estimates, and the method that utilizes frequencies of all subsequences including discontiguous ones. We applied them to restriction sites in complete genomes of 2500 prokaryotic species and demonstrated that the results depend greatly on the method used: lists of 5% of the most under-represented sites differed by up to 50%. The method designed by Burge and coauthors in 1992, which utilizes all subsequences of the sequence, showed a higher precision than the other two methods both on prokaryotic genomes and randomly generated sequences after computational imitation of selective pressure. We propose this method as the first choice for detection of exceptional sequences in prokaryotic genomes.

摘要

许多蛋白质需要识别特定的DNA序列才能发挥功能。识别位点的数量及其在DNA上的分布可能具有生物学意义。例如,原核生物和噬菌体基因组中限制位点的数量通常会减少,以降低限制性内切核酸酶切割DNA的概率。如果一个序列在基因组中的频率与某个数学模型预测的频率有显著差异,我们就称其为异常序列。根据与预测频率相比的实际频率,异常序列可能是低丰度或高丰度的。异常序列可能被认为具有生物学意义,例如,作为DNA结合蛋白的靶点或丰富重复元件的一部分。人们使用了几种基于短序列某些子序列的实际频率来预测其在基因组中频率的方法。最流行的是基于马尔可夫链模型的方法。但此前尚未对这些方法进行过严格比较。我们比较了三种预测短序列频率的方法:基于最大阶马尔可夫链模型的方法、使用扩展马尔可夫估计几何平均值的方法以及利用包括不连续子序列在内的所有子序列频率的方法。我们将它们应用于2500种原核生物完整基因组中的限制位点,并证明结果在很大程度上取决于所使用的方法:最不丰富的5%位点列表的差异高达50%。1992年由布尔格及其同事设计的利用序列所有子序列的方法,在原核生物基因组和经过计算模拟选择压力后的随机生成序列上,都显示出比其他两种方法更高的精度。我们建议将此方法作为检测原核生物基因组中异常序列的首选方法。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验