Keller Oliver, Odronitz Florian, Stanke Mario, Kollmar Martin, Waack Stephan
Universität Göttingen, Institut für Informatik, Lotzestr. 16-18, 37083 Göttingen, Germany.
BMC Bioinformatics. 2008 Jun 13;9:278. doi: 10.1186/1471-2105-9-278.
For many types of analyses, data about gene structure and locations of non-coding regions of genes are required. Although a vast amount of genomic sequence data is available, precise annotation of genes is lacking behind. Finding the corresponding gene of a given protein sequence by means of conventional tools is error prone, and cannot be completed without manual inspection, which is time consuming and requires considerable experience.
Scipio is a tool based on the alignment program BLAT to determine the precise gene structure given a protein sequence and a genome sequence. It identifies intron-exon borders and splice sites and is able to cope with sequencing errors and genes spanning several contigs in genomes that have not yet been assembled to supercontigs or chromosomes. Instead of producing a set of hits with varying confidence, Scipio gives the user a coherent summary of locations on the genome that code for the query protein. The output contains information about discrepancies that may result from sequencing errors. Scipio has also successfully been used to find homologous genes in closely related species. Scipio was tested with 979 protein queries against 16 arthropod genomes (intra species search). For cross-species annotation, Scipio was used to annotate 40 genes from Homo sapiens in the primates Pongo pygmaeus abelii and Callithrix jacchus. The prediction quality of Scipio was tested in a comparative study against that of BLAT and the well established program Exonerate.
Scipio is able to precisely map a protein query onto a genome. Even in cases when there are many sequencing errors, or when incomplete genome assemblies lead to hits that stretch across multiple target sequences, it very often provides the user with the correct determination of intron-exon borders and splice sites, showing an improved prediction accuracy compared to BLAT and Exonerate. Apart from being able to find genes in the genome that encode the query protein, Scipio can also be used to annotate genes in closely related species.
对于许多类型的分析而言,需要有关基因结构和基因非编码区位置的数据。尽管有大量的基因组序列数据可用,但基因的精确注释仍滞后。使用传统工具通过给定的蛋白质序列找到相应基因容易出错,并且如果没有人工检查就无法完成,这既耗时又需要相当多的经验。
Scipio是一种基于比对程序BLAT的工具,可根据蛋白质序列和基因组序列确定精确的基因结构。它能识别内含子-外显子边界和剪接位点,并且能够处理测序错误以及基因组中跨越多个重叠群(尚未组装成超级重叠群或染色体)的基因。Scipio不是生成一组具有不同置信度的匹配结果,而是为用户提供基因组上编码查询蛋白质的位置的连贯总结。输出包含可能由测序错误导致的差异信息。Scipio还成功用于在密切相关的物种中寻找同源基因。Scipio针对16个节肢动物基因组,用979个蛋白质查询进行了测试(种内搜索)。对于跨物种注释,Scipio用于注释来自人类的40个基因在猩猩和狨猴这两种灵长类动物中的情况。在一项比较研究中,将Scipio的预测质量与BLAT和成熟的程序Exonerate的预测质量进行了测试。
Scipio能够将蛋白质查询精确地定位到基因组上。即使在存在许多测序错误的情况下,或者当不完整的基因组组装导致匹配结果跨越多个目标序列时,它也常常能为用户提供内含子-外显子边界和剪接位点的正确判定,与BLAT和Exonerate相比,显示出更高的预测准确性。除了能够在基因组中找到编码查询蛋白质的基因外,Scipio还可用于注释密切相关物种中的基因。