Center for Quantitative Sciences, Vanderbilt University Medical Center, Nashville, TN 37232, USA.
Department of Biostatistics, Vanderbilt University Medical Center, Nashville, TN 37203, USA.
Nucleic Acids Res. 2019 Mar 18;47(5):e26. doi: 10.1093/nar/gky1294.
Identifying binding targets of RNA-binding proteins (RBPs) can greatly facilitate our understanding of their functional mechanisms. Most computational methods employ machine learning to train classifiers on either RBP-specific targets or pooled RBP-RNA interactions. The former strategy is more powerful, but it only applies to a few RBPs with a large number of known targets; conversely, the latter strategy sacrifices prediction accuracy for a wider application, since specific interaction features are inevitably obscured through pooling heterogeneous datasets. Here, we present beRBP, a dual approach to predict human RBP-RNA interaction given PWM of a RBP and one RNA sequence. Based on Random Forests, beRBP not only builds a specific model for each RBP with a decent number of known targets, but also develops a general model for RBPs with limited or null known targets. The specific and general models both compared well with existing methods on three benchmark datasets. Notably, the general model achieved a better performance than existing methods on most novel RBPs. Overall, as a composite solution overarching the RBP-specific and RBP-General strategies, beRBP is a promising tool for human RBP binding estimation with good prediction accuracy and a broad application scope.
鉴定 RNA 结合蛋白 (RBPs) 的结合靶标可以极大地帮助我们理解它们的功能机制。大多数计算方法都使用机器学习在 RBP 特异性靶标或 pooled RBP-RNA 相互作用上训练分类器。前者策略更强大,但它仅适用于少数具有大量已知靶标的 RBP;相反,后者策略为了更广泛的应用而牺牲了预测准确性,因为通过汇集异构数据集,不可避免地会掩盖特定的相互作用特征。在这里,我们提出了 beRBP,这是一种基于 PWM 的预测人类 RBP-RNA 相互作用的双重方法,给定一个 RBP 和一个 RNA 序列。基于随机森林,beRBP 不仅为具有一定数量已知靶标的每个 RBP 构建了一个特定的模型,还为具有有限或零已知靶标的 RBP 开发了一个通用模型。具体模型和通用模型在三个基准数据集上均与现有方法进行了比较。值得注意的是,通用模型在大多数新型 RBP 上的性能优于现有方法。总体而言,作为 RBP 特异性和 RBP 通用性策略的综合解决方案,beRBP 是一种很有前途的人类 RBP 结合估计工具,具有良好的预测准确性和广泛的应用范围。