Meyer Irmtraud M, Durbin Richard
Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SA, UK.
Nucleic Acids Res. 2004 Feb 4;32(2):776-83. doi: 10.1093/nar/gkh211. Print 2004.
One of the primary tasks in deciphering the functional contents of a newly sequenced genome is the identification of its protein coding genes. Existing computational methods for gene prediction include ab initio methods which use the DNA sequence itself as the only source of information, comparative methods using multiple genomic sequences, and similarity based methods which employ the cDNA or protein sequences of related genes to aid the gene prediction. We present here an algorithm implemented in a computer program called Projector which combines comparative and similarity approaches. Projector employs similarity information at the genomic DNA level by directly using known genes annotated on one DNA sequence to predict the corresponding related genes on another DNA sequence. It therefore makes explicit use of the conservation of the exon-intron structure between two related genes in addition to the similarity of their encoded amino acid sequences. We evaluate the performance of Projector by comparing it with the program Genewise on a test set of 491 pairs of independently confirmed mouse and human genes. It is more accurate than Genewise for genes whose proteins are <80% identical, and is suitable for use in a combined gene prediction system where other methods identify well conserved and non-conserved genes, and pseudogenes.
解读新测序基因组的功能内容的主要任务之一是识别其蛋白质编码基因。现有的基因预测计算方法包括:从头开始的方法,即仅将DNA序列本身作为信息来源;比较方法,使用多个基因组序列;以及基于相似性的方法,利用相关基因的cDNA或蛋白质序列辅助基因预测。我们在此展示一种在名为Projector的计算机程序中实现的算法,该算法结合了比较法和相似性方法。Projector通过直接利用注释在一个DNA序列上的已知基因来预测另一个DNA序列上的相应相关基因,从而在基因组DNA水平利用相似性信息。因此,除了编码氨基酸序列的相似性外,它还明确利用了两个相关基因之间外显子 - 内含子结构的保守性。我们通过在由491对经独立确认的小鼠和人类基因组成的测试集上,将Projector与Genewise程序进行比较,来评估Projector的性能。对于蛋白质相似度小于80%的基因,它比Genewise更准确,并且适用于组合基因预测系统,在该系统中其他方法可识别高度保守和非保守基因以及假基因。