Fickett J W, Tung C S
Theoretical Biology and Biophysics Group, Los Alamos National Laboratory, NM 87545.
Nucleic Acids Res. 1992 Dec 25;20(24):6441-50. doi: 10.1093/nar/20.24.6441.
A number of methods for recognizing protein coding genes in DNA sequence have been published over the last 13 years, and new, more comprehensive algorithms, drawing on the repertoire of existing techniques, continue to be developed. To optimize continued development, it is valuable to systematically review and evaluate published techniques. At the core of most gene recognition algorithms is one or more coding measures--functions which produce, given any sample window of sequence, a number or vector intended to measure the degree to which a sample sequence resembles a window of 'typical' exonic DNA. In this paper we review and synthesize the underlying coding measures from published algorithms. A standardized benchmark is described, and each of the measures is evaluated according to this benchmark. Our main conclusion is that a very simple and obvious measure--counting oligomers--is more effective than any of the more sophisticated measures. Different measures contain different information. However there is a great deal of redundancy in the current suite of measures. We show that in future development of gene recognition algorithms, attention can probably be limited to six of the twenty or so measures proposed to date.
在过去的13年里,已经发表了许多用于识别DNA序列中蛋白质编码基因的方法,并且基于现有技术的新的、更全面的算法仍在不断开发。为了优化持续发展,系统地回顾和评估已发表的技术是很有价值的。大多数基因识别算法的核心是一个或多个编码度量——这些函数在给定任何序列样本窗口的情况下,产生一个数字或向量,旨在衡量样本序列与“典型”外显子DNA窗口的相似程度。在本文中,我们回顾并综合了已发表算法中的潜在编码度量。描述了一个标准化的基准,并根据这个基准对每个度量进行评估。我们的主要结论是,一个非常简单且明显的度量——计算寡聚物——比任何更复杂的度量都更有效。不同的度量包含不同的信息。然而,当前的度量组中存在大量冗余。我们表明,在基因识别算法的未来发展中,注意力可能可以局限于迄今为止提出的二十多种度量中的六种。