Audic S, Claverie J M
Institute of Structural Biology and Microbiology, CNRS, Marseille, France.
Comput Chem. 1997;21(4):223-7. doi: 10.1016/s0097-8485(96)00040-x.
Eukaryotic promoters are among the most important functional domains yet to be characterized in a satisfactory manner in genomic sequences. Most current detection methods rely on the recognition of individual transcription elements using position-weight matrices (PWM) or consensus sequences. Here, we study a simple promoter detection algorithm based on Markov transition matrices built from sequences upward from proven transcription initiation sites. The performances have been evaluated on the training set and on a test set of promoter-containing sequences. The results on the training set are surprisingly good, given that the algorithm does not incorporate any specific knowledge about promoters. Yet, the program exhibits the pathological behaviour typical of all training set-based methods: a significant decline in performance when confronted with previously unseen sequences. Thus, the Markov algorithm, like the others presently available, does not truly capture the essence of eukaryotic promoters. A detection program based on a Markov model is likely to be blind to categories of promoters without close representatives in the training set.
真核生物启动子是基因组序列中尚未得到充分表征的最重要功能域之一。目前大多数检测方法依赖于使用位置权重矩阵(PWM)或共有序列来识别单个转录元件。在此,我们研究一种基于马尔可夫转移矩阵的简单启动子检测算法,该矩阵由已证实的转录起始位点向上构建的序列生成。已在训练集和一组含启动子序列的测试集上评估了该算法的性能。鉴于该算法未纳入任何关于启动子的特定知识,训练集上的结果出奇地好。然而,该程序表现出所有基于训练集的方法典型的病态行为:面对以前未见过的序列时性能显著下降。因此,马尔可夫算法与目前其他可用算法一样,并未真正抓住真核生物启动子的本质。基于马尔可夫模型的检测程序可能会对训练集中没有相近代表的启动子类别视而不见。