Harper Courtney A, Huang Conrad C, Stryke Doug, Kawamoto Michiko, Ferrin Thomas E, Babbitt Patricia C
Department of Biopharmaceutical Sciences, University of California San Francisco, 1700 4th Street, San Francisco, CA 94143-2250, USA.
BMC Genomics. 2006 Sep 18;7:236. doi: 10.1186/1471-2164-7-236.
Gene knockouts in a model organism such as mouse provide a valuable resource for the study of basic biology and human disease. Determining which gene has been inactivated by an untargeted gene trapping event poses a challenging annotation problem because gene trap sequence tags, which represent sequence near the vector insertion site of a trapped gene, are typically short and often contain unresolved residues. To understand better the localization of these sequences on the mouse genome, we compared stand-alone versions of the alignment programs BLAT, SSAHA, and MegaBLAST. A set of 3,369 sequence tags was aligned to build 34 of the mouse genome using default parameters for each algorithm. Known genome coordinates for the cognate set of full-length genes (1,659 sequences) were used to evaluate localization results.
In general, all three programs performed well in terms of localizing sequences to a general region of the genome, with only relatively subtle errors identified for a small proportion of the sequence tags. However, large differences in performance were noted with regard to correctly identifying exon boundaries. BLAT correctly identified the vast majority of exon boundaries, while SSAHA and MegaBLAST missed the majority of exon boundaries. SSAHA consistently reported the fewest false positives and is the fastest algorithm. MegaBLAST was comparable to BLAT in speed, but was the most susceptible to localizing sequence tags incorrectly to pseudogenes.
The differences in performance for sequence tags and full-length reference sequences were surprisingly small. Characteristic variations in localization results for each program were noted that affect the localization of sequence at exon boundaries, in particular.
在诸如小鼠这样的模式生物中进行基因敲除,为基础生物学和人类疾病的研究提供了宝贵资源。确定因非靶向基因捕获事件而失活的基因是一个具有挑战性的注释问题,因为基因捕获序列标签代表被捕获基因载体插入位点附近的序列,通常较短且常常包含未解析的残基。为了更好地理解这些序列在小鼠基因组上的定位,我们比较了比对程序BLAT、SSAHA和MegaBLAST的独立版本。使用每种算法的默认参数,将一组3369个序列标签与小鼠基因组的34构建体进行比对。使用全长基因同源集(1659个序列)的已知基因组坐标来评估定位结果。
总体而言,所有这三个程序在将序列定位到基因组的大致区域方面表现良好,仅一小部分序列标签存在相对细微的错误。然而,在正确识别外显子边界方面注意到性能存在很大差异。BLAT正确识别了绝大多数外显子边界,而SSAHA和MegaBLAST则错过了大多数外显子边界。SSAHA始终报告的假阳性最少,并且是最快的算法。MegaBLAST在速度上与BLAT相当,但最容易将序列标签错误地定位到假基因上。
序列标签和全长参考序列在性能上的差异小得出奇。注意到每个程序在定位结果上的特征性差异,特别是这些差异影响了外显子边界处序列的定位。