Rätsch G, Sonnenburg S, Schölkopf B
Friedrich Miescher Laboratory of the Max Planck Society Max Planck, Spemannstrasse 35, Tübingen, Germany.
Bioinformatics. 2005 Jun;21 Suppl 1:i369-77. doi: 10.1093/bioinformatics/bti1053.
Eukaryotic pre-mRNAs are spliced to form mature mRNA. Pre-mRNA alternative splicing greatly increases the complexity of gene expression. Estimates show that more than half of the human genes and at least one-third of the genes of less complex organisms, such as nematodes or flies, are alternatively spliced. In this work, we consider one major form of alternative splicing, namely the exclusion of exons from the transcript. It has been shown that alternatively spliced exons have certain properties that distinguish them from constitutively spliced exons. Although most recent computational studies on alternative splicing apply only to exons which are conserved among two species, our method only uses information that is available to the splicing machinery, i.e. the DNA sequence itself. We employ advanced machine learning techniques in order to answer the following two questions: (1) Is a certain exon alternatively spliced? (2) How can we identify yet unidentified exons within known introns?
We designed a support vector machine (SVM) kernel well suited for the task of classifying sequences with motifs having positional preferences. In order to solve the task (1), we combine the kernel with additional local sequence information, such as lengths of the exon and the flanking introns. The resulting SVM-based classifier achieves a true positive rate of 48.5% at a false positive rate of 1%. By scanning over single EST confirmed exons we identified 215 potential alternatively spliced exons. For 10 randomly selected such exons we successfully performed biological verification experiments and confirmed three novel alternatively spliced exons. To answer question (2), we additionally used SVM-based predictions to recognize acceptor and donor splice sites. Combined with the above mentioned features we were able to identify 85.2% of skipped exons within known introns at a false positive rate of 1%.
Datasets, model selection results, our predictions and additional experimental results are available at http://www.fml.tuebingen.mpg.de/~raetsch/RASE SUPPLEMENTARY INFORMATION: http://www.fml.tuebingen.mpg.de/raetsch/RASE.
真核生物的前体信使核糖核酸(pre-mRNA)经过剪接形成成熟的信使核糖核酸(mRNA)。前体信使核糖核酸的可变剪接极大地增加了基因表达的复杂性。据估计,超过一半的人类基因以及至少三分之一的低等生物(如线虫或果蝇)的基因会发生可变剪接。在本研究中,我们考虑可变剪接的一种主要形式,即从转录本中排除外显子。研究表明,可变剪接的外显子具有某些特性,使其与组成型剪接的外显子有所区别。尽管最近大多数关于可变剪接的计算研究仅适用于在两个物种中保守的外显子,但我们的方法仅使用剪接机制可获取的信息,即DNA序列本身。我们采用先进的机器学习技术来回答以下两个问题:(1)某个外显子是否发生了可变剪接?(2)我们如何在已知内含子中识别尚未鉴定的外显子?
我们设计了一种支持向量机(SVM)核,非常适合对具有位置偏好基序的序列进行分类的任务。为了解决问题(1),我们将该核与额外的局部序列信息(如外显子及其侧翼内含子的长度)相结合。由此产生的基于支持向量机的分类器在误报率为1%时,真阳性率达到48.5%。通过扫描单个经EST确认的外显子,我们鉴定出215个潜在的可变剪接外显子。对于随机选择的10个此类外显子,我们成功进行了生物学验证实验,并确认了3个新的可变剪接外显子。为了回答问题(2),我们额外使用基于支持向量机的预测来识别受体和供体剪接位点。结合上述特征,我们能够在误报率为1%时识别出已知内含子中85.2%的跳跃外显子。
数据集、模型选择结果、我们的预测以及其他实验结果可在http://www.fml.tuebingen.mpg.de/~raetsch/RASE获取。补充信息:http://www.fml.tuebingen.mpg.de/raetsch/RASE。