Department of Molecular Genetics, University of Toronto, Toronto, ON M5S 1A8, Canada.
Department of Molecular Genetics, University of Toronto, Toronto, ON M5S 1A8, Canada; Donnelly Centre, University of Toronto, Toronto, ON M5S 3E1, Canada; Canadian Institute for Advanced Research, MaRS Centre, West Tower, 661 University Avenue, Suite 505, Toronto, ON M5G 1M1, Canada.
Curr Opin Struct Biol. 2018 Dec;53:115-123. doi: 10.1016/j.sbi.2018.08.001. Epub 2018 Aug 29.
Identifying the binding preferences of RNA-binding proteins (RBPs) is important in understanding their contribution to post-transcriptional regulation. Here, we review the current state-of-the art of RNA motif identification tools for RBPs. New in vivo and in vitro data sets provide sufficient statistical power to enable detection of relatively long and complex sequence and sequence-structure binding preferences, and recent computational methods are geared towards quantitative identification of these patterns. We classify methods by their motif model's representational power and describe the underlying considerations for RNA-protein interactions. All classical motif identification algorithms apply physically motivated architectures, consisting of a motif and an occupancy model, we call these explicit motif models. Recent methods, such as convolutional neural networks and support vector machines, abandon the classical architecture and implicitly model RNA binding without defining a motif model. Although they achieve high accuracy on held-out data they may be unsuitable to solve the ultimate goal of the field, using motifs trained on in vitro data to predict in vivo binding sites. For this task methods need to separate intrinsic binding preferences from cellular effects from protein and RNA concentrations, cooperativity, and competition. To tackle this problem, we advocate for the use of a `three-layer' architecture, consisting of motif model, occupancy model, and extrinsic factor model, which enables separation and adjustment to cellular conditions.
鉴定 RNA 结合蛋白 (RBPs) 的结合偏好对于理解它们在后转录调控中的贡献非常重要。在这里,我们回顾了当前用于 RBPs 的 RNA 基序识别工具的最新进展。新的体内和体外数据集提供了足够的统计能力,能够检测到相对较长和复杂的序列和序列结构结合偏好,最近的计算方法也倾向于定量识别这些模式。我们根据基序模型的表示能力对方法进行分类,并描述了 RNA-蛋白质相互作用的基本考虑因素。所有经典的基序识别算法都应用了基于物理的架构,包括基序和占据模型,我们称之为显式基序模型。最近的方法,如卷积神经网络和支持向量机,放弃了经典架构,在不定义基序模型的情况下隐式地对 RNA 结合进行建模。尽管它们在保留数据上取得了很高的准确性,但它们可能不适合解决该领域的最终目标,即用体外数据训练的基序来预测体内结合位点。为此,需要从细胞效应、蛋白质和 RNA 浓度、协同作用和竞争中分离出内在的结合偏好。为了解决这个问题,我们提倡使用“三层”架构,包括基序模型、占据模型和外在因素模型,这使得分离和适应细胞条件成为可能。