Department of Computer Science, University of Illinois at Urbana-Champaign, Urbana, Illinois, United States of America.
School of Electronic Information and Electrical Engineering, Shanghai Jiao Tong University, Shanghai, China.
PLoS Comput Biol. 2019 Sep 4;15(9):e1007283. doi: 10.1371/journal.pcbi.1007283. eCollection 2019 Sep.
Predicting RNA-binding protein (RBP) specificity is important for understanding gene expression regulation and RNA-mediated enzymatic processes. It is widely believed that RBP binding specificity is determined by both the sequence and structural contexts of RNAs. Existing approaches, including traditional machine learning algorithms and more recently, deep learning models, have been extensively applied to integrate RNA sequence and its predicted or experimental RNA structural probabilities for improving the accuracy of RBP binding prediction. Such models were trained mostly on the large-scale in vitro datasets, such as the RNAcompete dataset. However, in RNAcompete, most synthetic RNAs are unstructured, which makes machine learning methods not effectively extract RBP-binding structural preferences. Furthermore, RNA structure may be variable or multi-modal according to both theoretical and experimental evidence. In this work, we propose ThermoNet, a thermodynamic prediction model by integrating a new sequence-embedding convolutional neural network model over a thermodynamic ensemble of RNA secondary structures. First, the sequence-embedding convolutional neural network generalizes the existing k-mer based methods by jointly learning convolutional filters and k-mer embeddings to represent RNA sequence contexts. Second, the thermodynamic average of deep-learning predictions is able to explore structural variability and improves the prediction, especially for the structured RNAs. Extensive experiments demonstrate that our method significantly outperforms existing approaches, including RCK, DeepBind and several other recent state-of-the-art methods for predictions on both in vitro and in vivo data. The implementation of ThermoNet is available at https://github.com/suyufeng/ThermoNet.
预测 RNA 结合蛋白 (RBP) 的特异性对于理解基因表达调控和 RNA 介导的酶过程至关重要。人们普遍认为,RBP 的结合特异性取决于 RNA 的序列和结构环境。现有的方法,包括传统的机器学习算法和最近的深度学习模型,已经被广泛应用于整合 RNA 序列及其预测或实验 RNA 结构概率,以提高 RBP 结合预测的准确性。这些模型主要在大规模的体外数据集上进行训练,例如 RNAcompete 数据集。然而,在 RNAcompete 中,大多数合成 RNA 是无结构的,这使得机器学习方法无法有效地提取 RBP 结合的结构偏好。此外,根据理论和实验证据,RNA 结构可能是可变的或多模态的。在这项工作中,我们提出了 ThermoNet,这是一种热力学预测模型,通过整合一个新的基于序列嵌入的卷积神经网络模型,对 RNA 二级结构的热力学集合进行预测。首先,基于序列嵌入的卷积神经网络通过联合学习卷积滤波器和 k-mer 嵌入来表示 RNA 序列上下文,从而推广了现有的 k-mer 方法。其次,深度学习预测的热力学平均值能够探索结构变异性并提高预测能力,特别是对于结构化 RNA。广泛的实验表明,我们的方法在体外和体内数据上的预测都显著优于现有的方法,包括 RCK、DeepBind 和其他几种最新的最先进方法。ThermoNet 的实现可在 https://github.com/suyufeng/ThermoNet 上获得。