基因结构预测程序的评估。

Evaluation of gene structure prediction programs.

作者信息

Burset M, Guigó R

机构信息

Departament d'Informàtica Mèdica, Institut Municipal d'Investigació Mèdica (IMIM), Barcelona, E-08003, Spain.

出版信息

Genomics. 1996 Jun 15;34(3):353-67. doi: 10.1006/geno.1996.0298.

DOI:10.1006/geno.1996.0298

PMID:8786136

Abstract

We evaluate a number of computer programs designed to predict the structure of protein coding genes in genomic DNA sequences. Computational gene identification is set to play an increasingly important role in the development of the genome projects, as emphasis turns from mapping to large-scale sequencing. The evaluation presented here serves both to assess the current status of the problem and to identify the most promising approaches to ensure further progress. The programs analyzed were uniformly tested on a large set of vertebrate sequences with simple gene structure, and several measures of predictive accuracy were computed at the nucleotide, exon, and protein product levels. The results indicated that the predictive accuracy of the programs analyzed was lower than originally found. The accuracy was even lower when considering only those sequences that had recently been entered and that did not show any similarity to previously entered sequences. This indicates that the programs are overly dependent on the particularities of the examples they learn from. For most of the programs, accuracy in this test set ranged from 0.60 to 0.70 as measured by the Correlation Coefficient (where 1.0 corresponds to a perfect prediction and 0.0 is the value expected for a random prediction), and the average percentage of exons exactly identified was less than 50%. Only those programs including protein sequence database searches showed substantially greater accuracy. The accuracy of the programs was severely affected by relatively high rates of sequence errors. Since the set on which the programs were tested included only relatively short sequences with simple gene structure, the accuracy of the programs is likely to be even lower when used for large uncharacterized genomic sequences with complex structure. While in such cases, programs currently available may still be of great use in pinpointing the regions likely to contain exons, they are far from being powerful enough to elucidate its genomic structure completely.

摘要

我们评估了一些旨在预测基因组DNA序列中蛋白质编码基因结构的计算机程序。随着重点从图谱绘制转向大规模测序，计算基因识别在基因组计划的发展中注定要发挥越来越重要的作用。这里给出的评估既用于评估该问题的当前状态，也用于确定最有前景的方法以确保取得进一步进展。所分析的程序在一大组具有简单基因结构的脊椎动物序列上进行了统一测试，并在核苷酸、外显子和蛋白质产物水平上计算了几种预测准确性的指标。结果表明，所分析程序的预测准确性低于最初发现的水平。当仅考虑那些最近输入且与先前输入序列没有任何相似性的序列时，准确性甚至更低。这表明这些程序过度依赖于它们所学习的示例的特殊性。对于大多数程序，在此测试集中，通过相关系数衡量的准确性范围为0.60至0.70（其中1.0对应于完美预测，0.0是随机预测预期的值），准确识别的外显子的平均百分比不到50%。只有那些包括蛋白质序列数据库搜索的程序显示出显著更高的准确性。程序的准确性受到相对较高的序列错误率的严重影响。由于测试程序所使用的序列集仅包括具有简单基因结构的相对短的序列，当用于具有复杂结构的大型未表征基因组序列时，程序的准确性可能会更低。虽然在这种情况下，当前可用的程序在确定可能包含外显子的区域方面可能仍然非常有用，但它们远不足以完全阐明其基因组结构。