Sinha Rileen, Hiller Michael, Pudimat Rainer, Gausmann Ulrike, Platzer Matthias, Backofen Rolf
Genome Analysis, Leibniz Institute for Age Research, Fritz Lipmann Institute, Jena, Germany.
BMC Bioinformatics. 2008 Nov 12;9:477. doi: 10.1186/1471-2105-9-477.
Alternative splicing is a major contributor to the diversity of eukaryotic transcriptomes and proteomes. Currently, large scale detection of alternative splicing using expressed sequence tags (ESTs) or microarrays does not capture all alternative splicing events. Moreover, for many species genomic data is being produced at a far greater rate than corresponding transcript data, hence in silico methods of predicting alternative splicing have to be improved.
Here, we show that the use of Bayesian networks (BNs) allows accurate prediction of evolutionary conserved exon skipping events. At a stringent false positive rate of 0.5%, our BN achieves an improved true positive rate of 61%, compared to a previously reported 50% on the same dataset using support vector machines (SVMs). Incorporating several novel discriminative features such as intronic splicing regulatory elements leads to the improvement. Features related to mRNA secondary structure increase the prediction performance, corroborating previous findings that secondary structures are important for exon recognition. Random labelling tests rule out overfitting. Cross-validation on another dataset confirms the increased performance. When using the same dataset and the same set of features, the BN matches the performance of an SVM in earlier literature. Remarkably, we could show that about half of the exons which are labelled constitutive but receive a high probability of being alternative by the BN, are in fact alternative exons according to the latest EST data. Finally, we predict exon skipping without using conservation-based features, and achieve a true positive rate of 29% at a false positive rate of 0.5%.
BNs can be used to achieve accurate identification of alternative exons and provide clues about possible dependencies between relevant features. The near-identical performance of the BN and SVM when using the same features shows that good classification depends more on features than on the choice of classifier. Conservation based features continue to be the most informative, and hence distinguishing alternative exons from constitutive ones without using conservation based features remains a challenging problem.
可变剪接是真核生物转录组和蛋白质组多样性的主要贡献因素。目前,使用表达序列标签(EST)或微阵列进行可变剪接的大规模检测无法捕获所有可变剪接事件。此外,对于许多物种而言,基因组数据的产生速度远远快于相应的转录数据,因此必须改进预测可变剪接的计算机方法。
在此,我们表明使用贝叶斯网络(BN)能够准确预测进化保守的外显子跳跃事件。在0.5%的严格假阳性率下,我们的BN实现了61%的改进真阳性率,相比之前在同一数据集上使用支持向量机(SVM)报告的50%有所提高。纳入几个新的判别特征,如内含子剪接调控元件,促成了这一改进。与mRNA二级结构相关的特征提高了预测性能,证实了先前关于二级结构对外显子识别很重要的发现。随机标记测试排除了过拟合。在另一个数据集上的交叉验证证实了性能的提高。当使用相同的数据集和相同的特征集时,BN与早期文献中SVM的性能相当。值得注意的是,我们可以表明,约一半被标记为组成型但被BN赋予高可变概率的外显子,根据最新的EST数据实际上是可变外显子。最后,我们在不使用基于保守性的特征的情况下预测外显子跳跃,并在假阳性率为0.5%时实现了29%的真阳性率。
BN可用于实现可变外显子的准确识别,并提供有关相关特征之间可能依赖性的线索。当使用相同特征时,BN和SVM的近乎相同的性能表明,良好的分类更多地取决于特征而不是分类器的选择。基于保守性的特征仍然是信息最丰富的,因此在不使用基于保守性的特征的情况下区分可变外显子和组成型外显子仍然是一个具有挑战性的问题。