Pavy N, Rombauts S, Déhais P, Mathé C, Ramana D V, Leroy P, Rouzé P
Laboratoire associé de l'INRA, France.
Bioinformatics. 1999 Nov;15(11):887-99. doi: 10.1093/bioinformatics/15.11.887.
The annotation of the Arabidopsis thaliana genome remains a problem in terms of time and quality. To improve the annotation process, we want to choose the most appropriate tools to use inside a computer-assisted annotation platform. We therefore need evaluation of prediction programs with Arabidopsis sequences containing multiple genes.
We have developed AraSet, a data set of contigs of validated genes, enabling the evaluation of multi-gene models for the Arabidopsis genome. Besides conventional metrics to evaluate gene prediction at the site and the exon levels, new measures were introduced for the prediction at the protein sequence level as well as for the evaluation of gene models. This evaluation method is of general interest and could apply to any new gene prediction software and to any eukaryotic genome. The GeneMark.hmm program appears to be the most accurate software at all three levels for the Arabidopsis genomic sequences. Gene modeling could be further improved by combination of prediction software.
The AraSet sequence set, the Perl programs and complementary results and notes are available at http://sphinx.rug.ac.be:8080/biocomp/napav/.
拟南芥基因组的注释在时间和质量方面仍然是个问题。为了改进注释过程,我们希望在计算机辅助注释平台内选择最合适的工具。因此,我们需要使用包含多个基因的拟南芥序列对预测程序进行评估。
我们开发了AraSet,这是一个经过验证的基因重叠群数据集,可用于评估拟南芥基因组的多基因模型。除了用于评估基因预测在位点和外显子水平的传统指标外,还引入了用于蛋白质序列水平预测以及基因模型评估的新方法。这种评估方法具有普遍意义,可应用于任何新的基因预测软件和任何真核生物基因组。GeneMark.hmm程序在拟南芥基因组序列的所有三个水平上似乎都是最准确的软件。通过组合预测软件可以进一步改进基因建模。
AraSet序列集、Perl程序以及补充结果和注释可在http://sphinx.rug.ac.be:8080/biocomp/napav/获取。