Orenstein Yaron, Wang Yuhao, Berger Bonnie
Computer Science and Artificial Intelligence Laboratory.
Computer Science and Artificial Intelligence Laboratory Math Department, MIT, Cambridge, MA, USA.
Bioinformatics. 2016 Jun 15;32(12):i351-i359. doi: 10.1093/bioinformatics/btw259.
Protein-RNA interactions, which play vital roles in many processes, are mediated through both RNA sequence and structure. CLIP-based methods, which measure protein-RNA binding in vivo, suffer from experimental noise and systematic biases, whereas in vitro experiments capture a clearer signal of protein RNA-binding. Among them, RNAcompete provides binding affinities of a specific protein to more than 240 000 unstructured RNA probes in one experiment. The computational challenge is to infer RNA structure- and sequence-based binding models from these data. The state-of-the-art in sequence models, Deepbind, does not model structural preferences. RNAcontext models both sequence and structure preferences, but is outperformed by GraphProt. Unfortunately, GraphProt cannot detect structural preferences from RNAcompete data due to the unstructured nature of the data, as noted by its developers, nor can it be tractably run on the full RNACompete dataset.
We develop RCK, an efficient, scalable algorithm that infers both sequence and structure preferences based on a new k-mer based model. Remarkably, even though RNAcompete data is designed to be unstructured, RCK can still learn structural preferences from it. RCK significantly outperforms both RNAcontext and Deepbind in in vitro binding prediction for 244 RNAcompete experiments. Moreover, RCK is also faster and uses less memory, which enables scalability. While currently on par with existing methods in in vivo binding prediction on a small scale test, we demonstrate that RCK will increasingly benefit from experimentally measured RNA structure profiles as compared to computationally predicted ones. By running RCK on the entire RNAcompete dataset, we generate and provide as a resource a set of protein-RNA structure-based models on an unprecedented scale.
Software and models are freely available at http://rck.csail.mit.edu/
Supplementary data are available at Bioinformatics online.
蛋白质与RNA的相互作用在许多过程中起着至关重要的作用,这种相互作用是通过RNA序列和结构介导的。基于CLIP的方法用于在体内测量蛋白质与RNA的结合,但存在实验噪声和系统偏差,而体外实验能捕捉到更清晰的蛋白质与RNA结合信号。其中,RNAcompete可在一次实验中提供特定蛋白质与超过240000个无结构RNA探针的结合亲和力。计算方面的挑战是从这些数据中推断基于RNA结构和序列的结合模型。序列模型中的先进方法Deepbind没有对结构偏好进行建模。RNAcontext对序列和结构偏好都进行了建模,但性能不如GraphProt。不幸的是,正如其开发者所指出的,由于数据的无结构性质,GraphProt无法从RNAcompete数据中检测结构偏好,也无法在完整的RNACompete数据集上进行有效运行。
我们开发了RCK,这是一种高效、可扩展的算法,它基于一种新的基于k-mer的模型推断序列和结构偏好。值得注意的是,尽管RNAcompete数据设计为无结构的,但RCK仍能从中学习结构偏好。在针对244个RNAcompete实验的体外结合预测中,RCK显著优于RNAcontext和Deepbind。此外,RCK速度更快且内存使用更少,具有可扩展性。虽然目前在小规模测试的体内结合预测方面与现有方法相当,但我们证明,与通过计算预测的RNA结构概况相比,RCK将越来越受益于实验测量的RNA结构概况。通过在整个RNAcompete数据集上运行RCK,我们以前所未有的规模生成并提供了一组基于蛋白质-RNA结构的模型作为资源。
软件和模型可在http://rck.csail.mit.edu/免费获取。
补充数据可在《生物信息学》在线获取。