Department of Computer Science and Engineering, Inha University, Incheon, South Korea.
Department of Computer Science and Engineering, Inha University, Incheon, South Korea.
Comput Methods Programs Biomed. 2015 Jun;120(1):3-15. doi: 10.1016/j.cmpb.2015.03.010. Epub 2015 Apr 8.
In recent years several computational methods have been developed to predict RNA-binding sites in protein. Most of these methods do not consider interacting partners of a protein, so they predict the same RNA-binding sites for a given protein sequence even if the protein binds to different RNAs. Unlike the problem of predicting RNA-binding sites in protein, the problem of predicting protein-binding sites in RNA has received little attention mainly because it is much more difficult and shows a lower accuracy on average. In our previous study, we developed a method that predicts protein-binding nucleotides from an RNA sequence. In an effort to improve the prediction accuracy and usefulness of the previous method, we developed a new method that uses both RNA and protein sequence data. In this study, we identified effective features of RNA and protein molecules and developed a new support vector machine (SVM) model to predict protein-binding nucleotides from RNA and protein sequence data. The new model that used both protein and RNA sequence data achieved a sensitivity of 86.5%, a specificity of 86.2%, a positive predictive value (PPV) of 72.6%, a negative predictive value (NPV) of 93.8% and Matthews correlation coefficient (MCC) of 0.69 in a 10-fold cross validation; it achieved a sensitivity of 58.8%, a specificity of 87.4%, a PPV of 65.1%, a NPV of 84.2% and MCC of 0.48 in independent testing. For comparative purpose, we built another prediction model that used RNA sequence data alone and ran it on the same dataset. In a 10 fold-cross validation it achieved a sensitivity of 85.7%, a specificity of 80.5%, a PPV of 67.7%, a NPV of 92.2% and MCC of 0.63; in independent testing it achieved a sensitivity of 67.7%, a specificity of 78.8%, a PPV of 57.6%, a NPV of 85.2% and MCC of 0.45. In both cross-validations and independent testing, the new model that used both RNA and protein sequences showed a better performance than the model that used RNA sequence data alone in most performance measures. To the best of our knowledge, this is the first sequence-based prediction of protein-binding nucleotides in RNA which considers the binding partner of RNA. The new model will provide valuable information for designing biochemical experiments to find putative protein-binding sites in RNA with unknown structure.
近年来,已经开发出几种计算方法来预测蛋白质中的 RNA 结合位点。这些方法大多没有考虑蛋白质的相互作用伙伴,因此即使蛋白质与不同的 RNA 结合,它们也会预测出相同的 RNA 结合位点。与预测蛋白质中 RNA 结合位点的问题不同,预测 RNA 中蛋白质结合位点的问题主要受到关注,这主要是因为它更困难,平均准确性较低。在我们之前的研究中,我们开发了一种从 RNA 序列预测蛋白质结合核苷酸的方法。为了提高以前方法的预测准确性和实用性,我们开发了一种使用 RNA 和蛋白质序列数据的新方法。在这项研究中,我们确定了 RNA 和蛋白质分子的有效特征,并开发了一种新的支持向量机 (SVM) 模型,用于从 RNA 和蛋白质序列数据中预测蛋白质结合核苷酸。在 10 倍交叉验证中,使用蛋白质和 RNA 序列数据的新模型的灵敏度为 86.5%,特异性为 86.2%,阳性预测值 (PPV) 为 72.6%,阴性预测值 (NPV) 为 93.8%,马修斯相关系数 (MCC) 为 0.69;在独立测试中,它的灵敏度为 58.8%,特异性为 87.4%,PPV 为 65.1%,NPV 为 84.2%,MCC 为 0.48。为了进行比较,我们构建了另一个仅使用 RNA 序列数据的预测模型,并在同一数据集上运行该模型。在 10 倍交叉验证中,它的灵敏度为 85.7%,特异性为 80.5%,PPV 为 67.7%,NPV 为 92.2%,MCC 为 0.63;在独立测试中,它的灵敏度为 67.7%,特异性为 78.8%,PPV 为 57.6%,NPV 为 85.2%,MCC 为 0.45。在交叉验证和独立测试中,与仅使用 RNA 序列数据的模型相比,使用 RNA 和蛋白质序列的新模型在大多数性能指标上都表现出更好的性能。据我们所知,这是第一个考虑 RNA 结合伙伴的基于序列的 RNA 中蛋白质结合核苷酸的预测。该新模型将为设计生化实验提供有价值的信息,以找到具有未知结构的 RNA 中假定的蛋白质结合位点。