Department of Electrical and Computer Engineering, University of Toronto, Toronto, Canada.
Bioinformatics. 2011 Sep 15;27(18):2554-62. doi: 10.1093/bioinformatics/btr444. Epub 2011 Jul 29.
Alternative splicing is a major contributor to cellular diversity in mammalian tissues and relates to many human diseases. An important goal in understanding this phenomenon is to infer a 'splicing code' that predicts how splicing is regulated in different cell types by features derived from RNA, DNA and epigenetic modifiers.
We formulate the assembly of a splicing code as a problem of statistical inference and introduce a Bayesian method that uses an adaptively selected number of hidden variables to combine subgroups of features into a network, allows different tissues to share feature subgroups and uses a Gibbs sampler to hedge predictions and ascertain the statistical significance of identified features.
Using data for 3665 cassette exons, 1014 RNA features and 4 tissue types derived from 27 mouse tissues (http://genes.toronto.edu/wasp), we benchmarked several methods. Our method outperforms all others, and achieves relative improvements of 52% in splicing code quality and up to 22% in classification error, compared with the state of the art. Novel combinations of regulatory features and novel combinations of tissues that share feature subgroups were identified using our method.
Supplementary data are available at Bioinformatics online.
可变剪接是哺乳动物组织中细胞多样性的主要贡献者,与许多人类疾病有关。理解这一现象的一个重要目标是推断出一种“剪接代码”,该代码可以根据来自 RNA、DNA 和表观遗传修饰物的特征来预测剪接在不同细胞类型中的调控方式。
我们将剪接代码的组合表述为一个统计推断问题,并引入了一种贝叶斯方法,该方法使用自适应选择的隐藏变量数量将特征的子组组合成一个网络,允许不同的组织共享特征子组,并使用 Gibbs 采样来对冲预测并确定鉴定特征的统计显著性。
使用来自 27 种小鼠组织(http://genes.toronto.edu/wasp)的 3665 个盒式外显子、1014 个 RNA 特征和 4 种组织类型的数据,我们对几种方法进行了基准测试。我们的方法优于所有其他方法,与最新技术相比,剪接代码质量提高了 52%,分类错误率降低了 22%。使用我们的方法可以识别出调控特征的新组合以及共享特征子组的新组织组合。
补充数据可在 Bioinformatics 在线获取。