Department of Computer and Information Science, School of Engineering, University of Pennsylvania, Philadelphia, PA, USA.
Department of Genetics, University of Pennsylvania, Philadelphia, PA, USA.
Bioinformatics. 2017 Jul 15;33(14):i274-i282. doi: 10.1093/bioinformatics/btx268.
Advancements in sequencing technologies have highlighted the role of alternative splicing (AS) in increasing transcriptome complexity. This role of AS, combined with the relation of aberrant splicing to malignant states, motivated two streams of research, experimental and computational. The first involves a myriad of techniques such as RNA-Seq and CLIP-Seq to identify splicing regulators and their putative targets. The second involves probabilistic models, also known as splicing codes, which infer regulatory mechanisms and predict splicing outcome directly from genomic sequence. To date, these models have utilized only expression data. In this work, we address two related challenges: Can we improve on previous models for AS outcome prediction and can we integrate additional sources of data to improve predictions for AS regulatory factors.
We perform a detailed comparison of two previous modeling approaches, Bayesian and Deep Neural networks, dissecting the confounding effects of datasets and target functions. We then develop a new target function for AS prediction in exon skipping events and show it significantly improves model accuracy. Next, we develop a modeling framework that leverages transfer learning to incorporate CLIP-Seq, knockdown and over expression experiments, which are inherently noisy and suffer from missing values. Using several datasets involving key splice factors in mouse brain, muscle and heart we demonstrate both the prediction improvements and biological insights offered by our new models. Overall, the framework we propose offers a scalable integrative solution to improve splicing code modeling as vast amounts of relevant genomic data become available.
Code and data available at: majiq.biociphers.org/jha_et_al_2017/.
Supplementary data are available at Bioinformatics online.
测序技术的进步凸显了选择性剪接 (AS) 在增加转录组复杂性方面的作用。AS 的这种作用,加上异常剪接与恶性状态的关系,激发了实验和计算这两个研究方向。第一个方向涉及到大量的技术,如 RNA-Seq 和 CLIP-Seq,以识别剪接调节剂及其潜在靶标。第二个方向涉及概率模型,也称为剪接代码,它可以从基因组序列直接推断调控机制并预测剪接结果。迄今为止,这些模型仅利用了表达数据。在这项工作中,我们解决了两个相关的挑战:我们能否改进以前的剪接结果预测模型,以及我们能否整合额外的数据来源来提高剪接调控因子的预测。
我们对两种以前的建模方法(贝叶斯和深度神经网络)进行了详细比较,剖析了数据集和目标函数的混杂影响。然后,我们为外显子跳跃事件的剪接预测开发了一个新的目标函数,并表明它显著提高了模型的准确性。接下来,我们开发了一个建模框架,利用迁移学习来整合 CLIP-Seq、敲低和过表达实验,这些实验本质上是嘈杂的,并且存在缺失值。使用涉及小鼠大脑、肌肉和心脏中的关键剪接因子的几个数据集,我们证明了我们的新模型在预测改进和生物学见解方面的优势。总体而言,我们提出的框架提供了一个可扩展的综合解决方案,可以在大量相关基因组数据可用的情况下,改进剪接代码建模。
代码和数据可在 majiq.biociphers.org/jha_et_al_2017/ 获取。
补充数据可在生物信息学在线获取。