School of Mathematical Sciences and LPMC, Nankai University, Tianjin 300071, China.
School of Statistics and Data Science, LPMC and KLMDASR, Nankai University, Tianjin 300071, China.
Int J Mol Sci. 2020 Sep 19;21(18):6879. doi: 10.3390/ijms21186879.
With close to 30 sequence-based predictors of RNA-binding residues (RBRs), this comparative survey aims to help with understanding and selection of the appropriate tools. We discuss past reviews on this topic, survey a comprehensive collection of predictors, and comparatively assess six representative methods. We provide a novel and well-designed benchmark dataset and we are the first to report and compare protein-level and datasets-level results, and to contextualize performance to specific types of RNAs. The methods considered here are well-cited and rely on machine learning algorithms on occasion combined with homology-based prediction. Empirical tests reveal that they provide relatively accurate predictions. Virtually all methods perform well for the proteins that interact with rRNAs, some generate accurate predictions for mRNAs, snRNA, SRP and IRES, while proteins that bind tRNAs are predicted poorly. Moreover, except for DRNApred, they confuse DNA and RNA-binding residues. None of the six methods consistently outperforms the others when tested on individual proteins. This variable and complementary protein-level performance suggests that users should not rely on applying just the single best dataset-level predictor. We recommend that future work should focus on the development of approaches that facilitate protein-level selection of accurate predictors and the consensus-based prediction of RBRs.
本文对近 30 种基于序列的 RNA 结合残基(RBR)预测因子进行了比较,旨在帮助理解和选择合适的工具。我们讨论了这个主题的过去的综述,调查了一个全面的预测因子集合,并对六种有代表性的方法进行了比较评估。我们提供了一个新颖而精心设计的基准数据集,首次报告和比较了蛋白质水平和数据集水平的结果,并将性能与特定类型的 RNA 联系起来。这里考虑的方法引用率很高,偶尔依赖于机器学习算法,有时还结合基于同源性的预测。实证测试表明,它们提供了相对准确的预测。几乎所有的方法在与 rRNA 相互作用的蛋白质上都表现良好,一些方法对 mRNA、snRNA、SRP 和 IRES 生成了准确的预测,而与 tRNA 结合的蛋白质则预测不佳。此外,除了 DRNApred 之外,它们还混淆了 DNA 和 RNA 结合残基。当在单个蛋白质上进行测试时,这六种方法中没有一种方法始终优于其他方法。这种可变的、互补的蛋白质水平性能表明,用户不应该仅仅依赖于应用单个最佳数据集水平的预测因子。我们建议未来的工作应集中于开发方法,以便于在蛋白质水平上选择准确的预测因子,并进行基于共识的 RBR 预测。