Barash Yoseph, Garcia Jorge Vaquero
Department of Genetics, University of Pennsylvania, Philadelphia, PA, USA.
Methods Mol Biol. 2014;1126:411-23. doi: 10.1007/978-1-62703-980-2_28.
Alternative splicing of pre-mRNA is a complex process whose outcome depends on elements reviewed in the previous chapters such as the core spliceosome units, how the core spliceosome units interact between themselves and with other splicing enhancers and repressors, primary sequence motifs, and local RNA secondary structure. Connections between RNA splicing, transcription, and other processes have also been reviewed in the previous chapters. Splicing is inherently a stochastic process: Some defective transcripts are produced and handled by mechanisms such as nonsense-mediated decay (NMD), and studies report high variability at the transcript level between cells supposedly in similar states. Nonetheless, splicing is obviously not a random process: Many determinants of splicing regulation have been identified, and experimental measurements detect highly robust and conserved splicing changes between developmental stages and tissues. These observations naturally lead to the following questions: Can we devise a method that predicts given a cellular context and the primary transcript what would be the splicing outcome? What can such a method tell us about the underlying mechanisms that govern alternative splicing?This chapter describes how these questions can be framed and addressed using machine-learning methodology. We describe how to extract putative RNA regulatory features from genomic sequence of exons and proximal introns, how to define target values based on experimental measurements of exon inclusion, how to learn a simple splicing model that optimizes the prediction the observed exon inclusion levels from the identified RNA features, and how to subsequently evaluate the learned model accuracy.
前体mRNA的可变剪接是一个复杂的过程,其结果取决于前几章中所阐述的各种因素,如核心剪接体单元、核心剪接体单元之间以及与其他剪接增强子和抑制子的相互作用方式、一级序列基序和局部RNA二级结构。前几章也对RNA剪接、转录及其他过程之间的联系进行了阐述。剪接本质上是一个随机过程:一些有缺陷的转录本会通过诸如无义介导的衰变(NMD)等机制产生并被处理,并且研究报告称,在假定处于相似状态的细胞之间,转录本水平存在高度变异性。尽管如此,剪接显然不是一个随机过程:许多剪接调控的决定因素已被确定,并且实验测量检测到发育阶段和组织之间存在高度稳健且保守的剪接变化。这些观察结果自然引出了以下问题:我们能否设计一种方法,在给定细胞环境和初级转录本时预测剪接结果会是什么?这样一种方法能告诉我们关于可变剪接的潜在机制的哪些信息?本章描述了如何使用机器学习方法来构建和解决这些问题。我们描述了如何从外显子和近端内含子的基因组序列中提取假定的RNA调控特征,如何基于外显子包含的实验测量来定义目标值,如何学习一个简单的剪接模型,该模型根据所识别的RNA特征优化对观察到的外显子包含水平的预测,以及如何随后评估所学习模型的准确性。