Kim Nak-Kyeong, Tharakaraman Kannan, Spouge John L
National Center for Biotechnology Information, National Library of Medicine National Institutes of Health, Bethesda, MD 20894, USA.
Bioinformatics. 2006 Dec 1;22(23):2870-5. doi: 10.1093/bioinformatics/btl528. Epub 2006 Oct 26.
Many computational methods for identifying regulatory elements use a likelihood ratio between motif and background models. Often, the methods use a background model of independent bases. At least two different Markov background models have been proposed with the aim of increasing the accuracy of predicting regulatory elements. Both Markov background models suffer theoretical drawbacks, so this article develops a third, context-dependent Markov background model from fundamental statistical principles.
Datasets containing known regulatory elements in eukaryotes provided a basis for comparing the predictive accuracies of the different background models. Non-parametric statistical tests indicated that Markov models of order 3 constituted a statistically significant improvement over the background model of independent bases. Our model performed slightly better than the previous Markov background models. We also found that for discriminating between the predictive accuracies of competing background models, the correlation coefficient is a more sensitive measure than the performance coefficient.
Our C++ program is available at ftp://ftp.ncbi.nih.gov/pub/spouge/papers/archive/AGLAM/2006-07-19
许多用于识别调控元件的计算方法使用基序模型和背景模型之间的似然比。这些方法通常使用独立碱基的背景模型。为了提高预测调控元件的准确性,至少已经提出了两种不同的马尔可夫背景模型。这两种马尔可夫背景模型都存在理论缺陷,因此本文从基本统计原理出发开发了第三种依赖上下文的马尔可夫背景模型。
包含真核生物中已知调控元件的数据集为比较不同背景模型的预测准确性提供了基础。非参数统计检验表明,三阶马尔可夫模型相对于独立碱基背景模型有统计学上的显著改进。我们的模型比之前的马尔可夫背景模型表现略好。我们还发现,对于区分竞争背景模型的预测准确性,相关系数比性能系数是更敏感的度量。
我们的C++程序可从ftp://ftp.ncbi.nih.gov/pub/spouge/papers/archive/AGLAM/2006-07-19获取