Cheng Cheng-Wei, Su Emily Chia-Yu, Hwang Jenn-Kang, Sung Ting-Yi, Hsu Wen-Lian
Institute of Information Systems and Applications, National Tsing Hua University, Hsinchu, Taiwan.
BMC Bioinformatics. 2008 Dec 12;9 Suppl 12(Suppl 12):S6. doi: 10.1186/1471-2105-9-S12-S6.
RNA-protein interaction plays an essential role in several biological processes, such as protein synthesis, gene expression, posttranscriptional regulation and viral infectivity. Identification of RNA-binding sites in proteins provides valuable insights for biologists. However, experimental determination of RNA-protein interaction remains time-consuming and labor-intensive. Thus, computational approaches for prediction of RNA-binding sites in proteins have become highly desirable. Extensive studies of RNA-binding site prediction have led to the development of several methods. However, they could yield low sensitivities in trade-off for high specificities.
We propose a method, RNAProB, which incorporates a new smoothed position-specific scoring matrix (PSSM) encoding scheme with a support vector machine model to predict RNA-binding sites in proteins. Besides the incorporation of evolutionary information from standard PSSM profiles, the proposed smoothed PSSM encoding scheme also considers the correlation and dependency from the neighboring residues for each amino acid in a protein. Experimental results show that smoothed PSSM encoding significantly enhances the prediction performance, especially for sensitivity. Using five-fold cross-validation, our method performs better than the state-of-the-art systems by 4.90%-6.83%, 0.88%-5.33%, and 0.10-0.23 in terms of overall accuracy, specificity, and Matthew's correlation coefficient, respectively. Most notably, compared to other approaches, RNAProB significantly improves sensitivity by 7.0%-26.9% over the benchmark data sets. To prevent data over fitting, a three-way data split procedure is incorporated to estimate the prediction performance. Moreover, physicochemical properties and amino acid preferences of RNA-binding proteins are examined and analyzed.
Our results demonstrate that smoothed PSSM encoding scheme significantly enhances the performance of RNA-binding site prediction in proteins. This also supports our assumption that smoothed PSSM encoding can better resolve the ambiguity of discriminating between interacting and non-interacting residues by modelling the dependency from surrounding residues. The proposed method can be used in other research areas, such as DNA-binding site prediction, protein-protein interaction, and prediction of posttranslational modification sites.
RNA-蛋白质相互作用在多个生物学过程中发挥着至关重要的作用,如蛋白质合成、基因表达、转录后调控和病毒感染性。确定蛋白质中的RNA结合位点为生物学家提供了有价值的见解。然而,通过实验确定RNA-蛋白质相互作用仍然耗时且费力。因此,用于预测蛋白质中RNA结合位点的计算方法变得非常必要。对RNA结合位点预测的广泛研究已经催生了几种方法。然而,它们在以高特异性为代价的情况下可能会产生较低的灵敏度。
我们提出了一种名为RNAProB的方法,该方法将一种新的平滑位置特异性评分矩阵(PSSM)编码方案与支持向量机模型相结合,以预测蛋白质中的RNA结合位点。除了纳入来自标准PSSM谱的进化信息外,所提出的平滑PSSM编码方案还考虑了蛋白质中每个氨基酸与其相邻残基之间的相关性和依赖性。实验结果表明,平滑PSSM编码显著提高了预测性能,尤其是在灵敏度方面。使用五折交叉验证,我们的方法在总体准确率、特异性和马修斯相关系数方面分别比现有最先进的系统高出4.90%-6.83%、0.88%-5.33%和0.10-0.23。最值得注意的是,与其他方法相比,RNAProB在基准数据集上的灵敏度显著提高了7.0%-26.9%。为了防止数据过拟合,采用了一种三路数据分割程序来估计预测性能。此外,还对RNA结合蛋白的理化性质和氨基酸偏好进行了研究和分析。
我们的结果表明,平滑PSSM编码方案显著提高了蛋白质中RNA结合位点预测的性能。这也支持了我们的假设,即平滑PSSM编码可以通过对周围残基的依赖性进行建模,更好地解决区分相互作用和非相互作用残基的模糊性。所提出的方法可用于其他研究领域,如DNA结合位点预测、蛋白质-蛋白质相互作用和翻译后修饰位点的预测。