Brendel Volker, Xing Liqun, Zhu Wei
Department of Genetics, Development and Cell Biology, Iowa State University, 2112 Molecular Biology Building, Ames, IA 50011-3260, USA.
Bioinformatics. 2004 May 1;20(7):1157-69. doi: 10.1093/bioinformatics/bth058. Epub 2004 Feb 5.
Accurate gene structure annotation is a challenging computational problem in genomics. The best results are achieved with spliced alignment of full-length cDNAs or multiple expressed sequence tags (ESTs) with sufficient overlap to cover the entire gene. For most species, cDNA and EST collections are far from comprehensive. We sought to overcome this bottleneck by exploring the possibility of using combined EST resources from fairly diverged species that still share a common gene space. Previous spliced alignment tools were found inadequate for this task because they rely on very high sequence similarity between the ESTs and the genomic DNA.
We have developed a computer program, GeneSeqer, which is capable of aligning thousands of ESTs with a long genomic sequence in a reasonable amount of time. The algorithm is uniquely designed to tolerate a high percentage of mismatches and insertions or deletions in the EST relative to the genomic template. This feature allows use of non-cognate ESTs for gene structure prediction, including ESTs derived from duplicated genes and homologous genes from related species. The increased gene prediction sensitivity results in part from novel splice site prediction models that are also available as a stand-alone splice site prediction tool. We assessed GeneSeqer performance relative to a standard Arabidopsis thaliana gene set and demonstrate its utility for plant genome annotation. In particular, we propose that this method provides a timely tool for the annotation of the rice genome, using abundant ESTs from other cereals and plants.
The source code is available for download at http://bioinformatics.iastate.edu/bioinformatics2go/gs/download.html. Web servers for Arabidopsis and other plant species are accessible at http://www.plantgdb.org/cgi-bin/AtGeneSeqer.cgi and http://www.plantgdb.org/cgi-bin/GeneSeqer.cgi, respectively. For non-plant species, use http://bioinformatics.iastate.edu/cgi-bin/gs.cgi. The splice site prediction tool (SplicePredictor) is distributed with the GeneSeqer code. A SplicePredictor web server is available at http://bioinformatics.iastate.edu/cgi-bin/sp.cgi
准确的基因结构注释是基因组学中一个具有挑战性的计算问题。通过将全长cDNA或多个具有足够重叠以覆盖整个基因的表达序列标签(EST)进行剪接比对可获得最佳结果。对于大多数物种而言,cDNA和EST文库远非全面。我们试图通过探索使用来自仍共享共同基因空间的相当分化物种的组合EST资源的可能性来克服这一瓶颈。发现以前的剪接比对工具不足以完成此任务,因为它们依赖于EST与基因组DNA之间非常高的序列相似性。
我们开发了一个计算机程序GeneSeqer,它能够在合理的时间内将数千个EST与长基因组序列进行比对。该算法经过独特设计,能够容忍EST相对于基因组模板的高比例错配、插入或缺失。此特性允许使用非同源EST进行基因结构预测,包括来自重复基因和相关物种同源基因的EST。基因预测灵敏度的提高部分源于新颖的剪接位点预测模型,该模型也可作为独立的剪接位点预测工具使用。我们相对于标准拟南芥基因集评估了GeneSeqer的性能,并证明了其在植物基因组注释中的效用。特别是,我们提出该方法为利用来自其他谷物和植物的丰富EST注释水稻基因组提供了一个及时的工具。