Salamov A A, Solovyev V V
The Sanger Centre, Hinxton, Cambridge CB10 1SA, UK.
Genome Res. 2000 Apr;10(4):516-22. doi: 10.1101/gr.10.4.516.
Ab initio gene identification in the genomic sequence of Drosophila melanogaster was obtained using (human gene predictor) and Fgenesh programs that have organism-specific parameters for human, Drosophila, plants, yeast, and nematode. We did not use information about cDNA/EST in most predictions to model a real situation for finding new genes because information about complete cDNA is often absent or based on very small partial fragments. We investigated the accuracy of gene prediction on different levels and designed several schemes to predict an unambiguous set of genes (annotation CGG1), a set of reliable exons (annotation CGG2), and the most complete set of exons (annotation CGG3). For 49 genes, protein products of which have clear homologs in protein databases, predictions were recomputed by Fgenesh+ program. The first annotation serves as the optimal computational description of new sequence to be presented in a database. Reliable exons from the second annotation serve as good candidates for selecting the PCR primers for experimental work for gene structure verification. Our results shows that we can identify approximately 90% of coding nucleotides with 20% false positives. At the exon level we accurately predicted 65% of exons and 89% including overlapping exons with 49% false positives. Optimizing accuracy of prediction, we designed a gene identification scheme using Fgenesh, which provided sensitivity (Sn) = 98% and specificity (Sp) = 86% at the base level, Sn = 81% (97% including overlapping exons) and Sp = 58% at the exon level and Sn = 72% and Sp = 39% at the gene level (estimating sensitivity on std1 set and specificity on std3 set). In general, these results showed that computational gene prediction can be a reliable tool for annotating new genomic sequences, giving accurate information on 90% of coding sequences with 14% false positives. However, exact gene prediction (especially at the gene level) needs additional improvement using gene prediction algorithms. The program was also tested for predicting genes of human Chromosome 22 (the last variant of Fgenesh can analyze the whole chromosome sequence). This analysis has demonstrated that the 88% of manually annotated exons in Chromosome 22 were among the ab initio predicted exons. The suite of gene identification programs is available through the WWW server of Computational Genomics Group at http://genomic.sanger.ac.uk/gf. html.
利用(人类基因预测器)和Fgenesh程序对黑腹果蝇基因组序列进行从头基因识别,这些程序具有针对人类、果蝇、植物、酵母和线虫的特定生物体参数。在大多数预测中,我们没有使用cDNA/EST信息来模拟发现新基因的实际情况,因为完整cDNA信息通常缺失或基于非常小的部分片段。我们在不同水平上研究了基因预测的准确性,并设计了几种方案来预测明确的基因集(注释CGG1)、可靠的外显子集(注释CGG2)和最完整的外显子集(注释CGG3)。对于49个基因,其蛋白质产物在蛋白质数据库中有明确的同源物,通过Fgenesh+程序重新计算预测结果。第一个注释作为要在数据库中呈现的新序列的最佳计算描述。第二个注释中的可靠外显子是选择用于基因结构验证实验工作的PCR引物的良好候选者。我们的结果表明,我们可以识别大约90%的编码核苷酸,假阳性率为20%。在外显子水平上,我们准确预测了65%的外显子,包括重叠外显子在内为89%,假阳性率为49%。为了优化预测准确性,我们设计了一种使用Fgenesh的基因识别方案,该方案在碱基水平上提供的灵敏度(Sn)=98%,特异性(Sp)=86%,在外显子水平上Sn=81%(包括重叠外显子在内为97%),Sp=58%,在基因水平上Sn=72%,Sp=39%(在std1集上估计灵敏度,在std3集上估计特异性)。总体而言,这些结果表明,计算基因预测可以成为注释新基因组序列的可靠工具,能给出90%编码序列的准确信息,假阳性率为14%。然而,精确的基因预测(尤其是在基因水平上)需要使用基因预测算法进行进一步改进。该程序还经过测试用于预测人类22号染色体的基因(Fgenesh的最新版本可以分析整个染色体序列)。该分析表明,22号染色体中88%的人工注释外显子在从头预测的外显子之中。这套基因识别程序可通过计算基因组学小组的万维网服务器获取,网址为http://genomic.sanger.ac.uk/gf.html。