Wu Thomas D, Watanabe Colin K
Department of Bioinformatics Genentech, Inc., South San Francisco, CA 94080, USA.
Bioinformatics. 2005 May 1;21(9):1859-75. doi: 10.1093/bioinformatics/bti310. Epub 2005 Feb 22.
We introduce GMAP, a standalone program for mapping and aligning cDNA sequences to a genome. The program maps and aligns a single sequence with minimal startup time and memory requirements, and provides fast batch processing of large sequence sets. The program generates accurate gene structures, even in the presence of substantial polymorphisms and sequence errors, without using probabilistic splice site models. Methodology underlying the program includes a minimal sampling strategy for genomic mapping, oligomer chaining for approximate alignment, sandwich DP for splice site detection, and microexon identification with statistical significance testing.
On a set of human messenger RNAs with random mutations at a 1 and 3% rate, GMAP identified all splice sites accurately in over 99.3% of the sequences, which was one-tenth the error rate of existing programs. On a large set of human expressed sequence tags, GMAP provided higher-quality alignments more often than blat did. On a set of Arabidopsis cDNAs, GMAP performed comparably with GeneSeqer. In these experiments, GMAP demonstrated a several-fold increase in speed over existing programs.
Source code for gmap and associated programs is available at http://www.gene.com/share/gmap
我们介绍了GMAP,一个用于将cDNA序列映射和比对到基因组的独立程序。该程序以最少的启动时间和内存需求来映射和比对单个序列,并能对大型序列集进行快速批量处理。即使存在大量多态性和序列错误,该程序也能生成准确的基因结构,且不使用概率性剪接位点模型。该程序的基础方法包括用于基因组映射的最小采样策略、用于近似比对的寡聚物链接、用于剪接位点检测的夹心动态规划以及具有统计显著性检验的微外显子识别。
在一组以1%和3%的速率存在随机突变的人类信使RNA上,GMAP在超过99.3%的序列中准确识别了所有剪接位点,这是现有程序错误率的十分之一。在一大组人类表达序列标签上,GMAP比blat更常提供更高质量的比对。在一组拟南芥cDNA上,GMAP的表现与GeneSeqer相当。在这些实验中,GMAP的速度比现有程序提高了几倍。
gmap及相关程序的源代码可在http://www.gene.com/share/gmap获取。