Rogic S, Mackworth A K, Ouellette F B
Computer Science Department, The University of California at Santa Cruz, Santa Cruz 95064, USA.
Genome Res. 2001 May;11(5):817-32. doi: 10.1101/gr.147901.
We present an independent comparative analysis of seven recently developed gene-finding programs: FGENES, GeneMark.hmm, Genie, Genescan, HMMgene, Morgan, and MZEF. For evaluation purposes we developed a new, thoroughly filtered, and biologically validated dataset of mammalian genomic sequences that does not overlap with the training sets of the programs analyzed. Our analysis shows that the new generation of programs has substantially better results than the programs analyzed in previous studies. The accuracy of the programs was also examined as a function of various sequence and prediction features, such as G + C content of the sequence, length and type of exons, signal type, and score of the exon prediction. This approach pinpoints the strengths and weaknesses of each individual program as well as those of computational gene-finding in general. The dataset used in this analysis (HMR195) as well as the tables with the complete results are available at http://www.cs.ubc.ca/~rogic/evaluation/.
FGENES、GeneMark.hmm、Genie、Genescan、HMMgene、Morgan和MZEF。为了进行评估,我们开发了一个新的、经过全面筛选且经过生物学验证的哺乳动物基因组序列数据集,该数据集与所分析程序的训练集不重叠。我们的分析表明,新一代程序的结果比先前研究中分析的程序有显著更好的表现。还根据各种序列和预测特征(如序列的G + C含量、外显子的长度和类型、信号类型以及外显子预测得分)对程序的准确性进行了检验。这种方法明确了每个程序以及一般计算基因发现的优势和劣势。本分析中使用的数据集(HMR195)以及包含完整结果的表格可在http://www.cs.ubc.ca/~rogic/evaluation/获取。