Burset M, Seledtsov I A, Solovyev V V
Informatic Division, The Sanger Centre, Hinxton, Cambridge, CB10 1SA, UK.
Nucleic Acids Res. 2000 Nov 1;28(21):4364-75. doi: 10.1093/nar/28.21.4364.
A set of 43 337 splice junction pairs was extracted from mammalian GenBank annotated genes. Expressed sequence tag (EST) sequences support 22 489 of them. Of these, 98.71% contain canonical dinucleotides GT and AG for donor and acceptor sites, respectively; 0.56% hold non-canonical GC-AG splice site pairs; and the remaining 0.73% occurs in a lot of small groups (with a maximum size of 0.05%). Studying these groups we observe that many of them contain splicing dinucleotides shifted from the annotated splice junction by one position. After close examination of such cases we present a new classification consisting of only eight observed types of splice site pairs (out of 256 a priori possible combinations). EST alignments allow us to verify the exonic part of the splice sites, but many non-canonical cases may be due to intron sequencing errors. This idea is given substantial support when we compare the sequences of human genes having non-canonical splice sites deposited in GenBank by high throughput genome sequencing projects (HTG). A high proportion (156 out of 171) of the human non-canonical and EST-supported splice site sequences had a clear match in the human HTG. They can be classified after corrections as: 79 GC-AG pairs (of which one was an error that corrected to GC-AG), 61 errors that were corrected to GT-AG canonical pairs, six AT-AC pairs (of which two were errors that corrected to AT-AC), one case was produced from non-existent intron, seven cases were found in HTG that were deposited to GenBank and finally there were only two cases left of supported non-canonical splice sites. If we assume that approximately the same situation is true for the whole set of annotated mammalian non-canonical splice sites, then the 99.24% of splice site pairs should be GT-AG, 0.69% GC-AG, 0.05% AT-AC and finally only 0.02% could consist of other types of non-canonical splice sites. We analyze several characteristics of EST-verified splice sites and build weight matrices for the major groups, which can be incorporated into gene prediction programs. We also present a set of EST-verified canonical splice sites larger by two orders of magnitude than the current one (22 199 entries versus approximately 600) and finally, a set of 290 EST-supported non-canonical splice sites. Both sets should be significant for future investigations of the splicing mechanism.
从哺乳动物基因库注释基因中提取了一组43337个剪接连接对。表达序列标签(EST)序列支持其中的22489个。其中,98.71%分别在供体和受体位点包含典型的二核苷酸GT和AG;0.56%具有非典型的GC-AG剪接位点对;其余0.73%分布在许多小群体中(最大群体规模为0.05%)。研究这些群体时我们发现,其中许多包含从注释的剪接连接处偏移一个位置的剪接二核苷酸。在仔细检查这些情况后,我们提出了一种新的分类,该分类仅包含观察到的8种剪接位点对类型(在256种先验可能的组合中)。EST比对使我们能够验证剪接位点的外显子部分,但许多非典型情况可能是由于内含子测序错误。当我们比较通过高通量基因组测序项目(HTG)存放在基因库中的具有非典型剪接位点的人类基因序列时,这一观点得到了有力支持。人类非典型且由EST支持的剪接位点序列中有很大一部分(171个中的156个)在人类HTG中有明确匹配。经过校正后它们可分类为:79个GC-AG对(其中一个是校正为GC-AG的错误),61个校正为GT-AG典型对的错误,6个AT-AC对(其中两个是校正为AT-AC的错误),1个由不存在的内含子产生的情况,在HTG中发现并存入基因库的7个情况,最后只剩下2个由EST支持的非典型剪接位点情况。如果我们假设对于整个注释的哺乳动物非典型剪接位点集情况大致相同,那么99.24%的剪接位点对应该是GT-AG,0.69%是GC-AG,0.05%是AT-AC,最后只有0.02%可能由其他类型的非典型剪接位点组成。我们分析了EST验证的剪接位点的几个特征,并为主要群体构建了权重矩阵,可将其纳入基因预测程序。我们还展示了一组EST验证的典型剪接位点,其规模比当前的大两个数量级(22199条记录对大约600条),最后展示了一组290个由EST支持的非典型剪接位点。这两组对于未来剪接机制的研究都应该具有重要意义。