Haas Brian J, Delcher Arthur L, Mount Stephen M, Wortman Jennifer R, Smith Roger K, Hannick Linda I, Maiti Rama, Ronning Catherine M, Rusch Douglas B, Town Christopher D, Salzberg Steven L, White Owen
The Institute for Genomic Research, 9712 Medical Center Drive, Rockville, MD 20850, USA.
Nucleic Acids Res. 2003 Oct 1;31(19):5654-66. doi: 10.1093/nar/gkg770.
The spliced alignment of expressed sequence data to genomic sequence has proven a key tool in the comprehensive annotation of genes in eukaryotic genomes. A novel algorithm was developed to assemble clusters of overlapping transcript alignments (ESTs and full-length cDNAs) into maximal alignment assemblies, thereby comprehensively incorporating all available transcript data and capturing subtle splicing variations. Complete and partial gene structures identified by this method were used to improve The Institute for Genomic Research Arabidopsis genome annotation (TIGR release v.4.0). The alignment assemblies permitted the automated modeling of several novel genes and >1000 alternative splicing variations as well as updates (including UTR annotations) to nearly half of the approximately 27 000 annotated protein coding genes. The algorithm of the Program to Assemble Spliced Alignments (PASA) tool is described, as well as the results of automated updates to Arabidopsis gene annotations.
将表达序列数据与基因组序列进行剪接比对,已被证明是真核生物基因组中基因全面注释的关键工具。我们开发了一种新算法,用于将重叠转录本比对(EST和全长cDNA)的簇组装成最大比对组件,从而全面整合所有可用的转录本数据并捕捉细微的剪接变异。通过这种方法鉴定出的完整和部分基因结构,被用于改进美国基因组研究所的拟南芥基因组注释(TIGR版本4.0)。这些比对组件允许对几个新基因和1000多个可变剪接变异进行自动建模,以及对约27000个已注释蛋白质编码基因中近一半进行更新(包括UTR注释)。本文描述了拼接比对组装程序(PASA)工具的算法,以及拟南芥基因注释自动更新的结果。