Gotoh Osamu
Department of Intelligence Science and Technology, Graduate School of Informatics, Kyoto University, Yoshida Honmachi, Sakyo-ku, Kyoto 606-8501, Japan.
Bioinformatics. 2008 Nov 1;24(21):2438-44. doi: 10.1093/bioinformatics/btn460. Epub 2008 Aug 26.
Finding protein-coding genes in a newly determined genomic sequence is the first step toward understanding the content written in the genome. Sequences of transcripts of homologous genes, if available, can considerably improve accuracy of prediction of genes and their structures, compared with that without such knowledge. As protein sequences are generally better conserved than nucleotide sequences, remote homologs can be used as templates, extending the applicability of evidence-based gene recognition methods. However, no tool seems to have been developed so far to simultaneously map and align a number of protein sequences on mammalian-sized genomic sequence.
We have extended our computer program Spaln to accept protein sequences, as well as cDNA sequences, as queries. When the query and the target sequences are reasonably similar, e.g. between mammalian orthologs, Spaln runs one to two orders of magnitude faster than conventional approaches that rely on Blast search followed by dynamic-programming-based spliced alignment. Exon-level and gene-level accuracies of Spaln are significantly higher than those obtained by the best available methods of the same type, particularly when the query and the target are distantly related.
Spaln is accessible online for a few species at http://www.genome.ist.i.kyoto-u.ac.jp/~aln_user. The source code is available for free for academic users from the same site.
在新测定的基因组序列中寻找蛋白质编码基因是理解基因组所蕴含信息的第一步。如果有同源基因的转录本序列,与没有此类信息的情况相比,它能显著提高基因及其结构预测的准确性。由于蛋白质序列通常比核苷酸序列保守性更好,远缘同源物可用作模板,从而扩展基于证据的基因识别方法的适用性。然而,到目前为止,似乎还没有开发出一种工具能够在哺乳动物大小的基因组序列上同时对多个蛋白质序列进行定位和比对。
我们已对计算机程序Spaln进行扩展,使其能够接受蛋白质序列以及cDNA序列作为查询序列。当查询序列和目标序列相似度合理时,例如哺乳动物直系同源物之间,Spaln的运行速度比依赖于Blast搜索后进行基于动态规划的剪接比对的传统方法快一到两个数量级。Spaln在外显子水平和基因水平的准确性显著高于同类最佳可用方法,尤其是当查询序列和目标序列关系较远时。
可通过http://www.genome.ist.i.kyoto-u.ac.jp/~aln_user在线访问针对少数物种的Spaln。学术用户可从同一网站免费获取源代码。