Guigó R
Institut Municipal d'Investigació Mèdica, Departament d'Estadística, Universitat de Barcelona, Spain.
J Comput Biol. 1998 Winter;5(4):681-702. doi: 10.1089/cmb.1998.5.681.
In a number of programs for gene structure prediction in higher eukaryotic genomic sequences, exon prediction is decoupled from gene assembly: a large pool of candidate exons is predicted and scored from features located in the query DNA sequence, and candidate genes are assembled from such a pool as sequences of nonoverlapping frame-compatible exons. Genes are scored as a function of the scores of the assembled exons, and the highest scoring candidate gene is assumed to be the most likely gene encoded by the query DNA sequence. Considering additive gene scoring functions, currently available algorithms to determine such a highest scoring candidate gene run in time proportional to the square of the number of predicted exons. Here, we present an algorithm whose running time grows only linearly with the size of the set of predicted exons. Polynomial algorithms rely on the fact that, while scanning the set of predicted exons, the highest scoring gene ending in a given exon can be obtained by appending the exon to the highest scoring among the highest scoring genes ending at each compatible preceding exon. The algorithm here relies on the simple fact that such highest scoring gene can be stored and updated. This requires scanning the set of predicted exons simultaneously by increasing acceptor and donor position. On the other hand, the algorithm described here does not assume an underlying gene structure model. Indeed, the definition of valid gene structures is externally defined in the so-called Gene Model. The Gene Model specifies simply which gene features are allowed immediately upstream which other gene features in valid gene structures. This allows for great flexibility in formulating the gene identification problem. In particular it allows for multiple-gene two-strand predictions and for considering gene features other than coding exons (such as promoter elements) in valid gene structures.
在许多用于高等真核生物基因组序列基因结构预测的程序中,外显子预测与基因组装是分离的:从查询DNA序列中的特征预测并评分大量候选外显子,然后将候选基因组装成这样一个池,作为非重叠框架兼容外显子的序列。基因根据组装外显子的分数进行评分,得分最高的候选基因被认为是查询DNA序列最可能编码的基因。考虑到累加基因评分函数,目前用于确定这种得分最高的候选基因的算法运行时间与预测外显子数量的平方成正比。在这里,我们提出一种算法,其运行时间仅随预测外显子集的大小线性增长。多项式算法依赖于这样一个事实,即在扫描预测外显子集时,通过将外显子附加到每个兼容的前一个外显子末端的得分最高的基因中得分最高的基因上,可以获得以给定外显子结尾的得分最高的基因。这里的算法依赖于这样一个简单事实,即这种得分最高的基因可以被存储和更新。这需要通过增加受体和供体位置同时扫描预测外显子集。另一方面,这里描述的算法不假设潜在的基因结构模型。实际上,有效基因结构的定义在所谓的基因模型中是外部定义的。基因模型简单地指定了在有效基因结构中哪些基因特征允许紧邻哪些其他基因特征的上游。这在制定基因识别问题时提供了很大的灵活性。特别是它允许进行多基因双链预测,并在有效基因结构中考虑除编码外显子之外的基因特征(如启动子元件)。