School of Computer Science and Engineering, Inha University, Inchon 402-751, South Korea.
BMC Bioinformatics. 2011;12 Suppl 13(Suppl 13):S7. doi: 10.1186/1471-2105-12-S13-S7. Epub 2011 Nov 30.
Many learning approaches to predicting RNA-binding residues in a protein sequence construct a non-redundant training dataset based on the sequence similarity. The sequence similarity-based method either takes a whole sequence or discards it for a training dataset. However, similar sequences or even identical sequences can have different interaction sites depending on their interaction partners, and this information is lost when the sequences are removed. Furthermore, a training dataset constructed by the sequence similarity-based method may contain redundant data when the remaining sequence contains similar subsequences within the sequence. In addition to the problem with the training dataset, most approaches do not consider the interacting partner (i.e., RNA) of a protein when they predict RNA-binding amino acids. Thus, they always predict the same RNA-binding sites for a given protein sequence even if the protein binds to different RNA molecules.
We developed a feature vector-based method that removes data redundancy for a non-redundant training dataset. The feature vector-based method constructed a larger training dataset than the standard sequence similarity-based method, yet the dataset contained no redundant data. We identified effective features of protein and RNA (the interaction propensity of amino acid triplets, global features of the protein sequence, and RNA feature) for predicting RNA-binding residues. Using the method and features, we built a support vector machine (SVM) model that predicted RNA-binding residues in a protein sequence. Our SVM model showed an accuracy of 84.2%, an F-measure of 76.1%, and a correlation coefficient of 0.41 with 5-fold cross validation on a non-redundant dataset from 3,149 protein-RNA interacting pairs. In an independent test dataset that does not include the 3,149 pairs and were not used in training the SVM model, it achieved an accuracy of 90.3%, an F-measure of 72.8%, and a correlation coefficient of 0.24. Comparison with other methods on the same datasets demonstrated that our model was better than the others.
The feature vector-based redundancy reduction method is powerful for constructing a non-redundant training dataset for a learning model since it generates a larger dataset with non-redundant data than the standard sequence similarity-based method. Including the features of both RNA and protein sequences in a feature vector results in better performance than using the protein features only when predicting the RNA-binding residues in a protein sequence.
许多预测蛋白质序列中 RNA 结合残基的学习方法都是基于序列相似性构建非冗余的训练数据集。基于序列相似性的方法要么采用整个序列,要么将其丢弃作为训练数据集。然而,相似的序列甚至相同的序列可能具有不同的相互作用位点,具体取决于它们的相互作用伙伴,而当序列被删除时,这些信息就会丢失。此外,基于序列相似性的方法构建的训练数据集可能包含冗余数据,因为剩余序列中可能包含序列内的相似子序列。除了训练数据集的问题外,大多数方法在预测 RNA 结合氨基酸时都不考虑蛋白质的相互作用伙伴(即 RNA)。因此,即使给定的蛋白质与不同的 RNA 分子结合,它们也总是预测相同的 RNA 结合位点。
我们开发了一种基于特征向量的方法,可以去除非冗余训练数据集中的冗余数据。基于特征向量的方法构建的训练数据集比标准的基于序列相似性的方法大,但数据集不包含冗余数据。我们确定了用于预测 RNA 结合残基的蛋白质和 RNA 的有效特征(氨基酸三联体的相互作用倾向、蛋白质序列的全局特征和 RNA 特征)。使用该方法和特征,我们构建了一个支持向量机(SVM)模型,用于预测蛋白质序列中的 RNA 结合残基。我们的 SVM 模型在 3149 个蛋白质-RNA 相互作用对的非冗余数据集中进行 5 倍交叉验证时,准确率为 84.2%,F1 度量值为 76.1%,相关系数为 0.41。在不包含 3149 对且未用于训练 SVM 模型的独立测试数据集中,它的准确率为 90.3%,F1 度量值为 72.8%,相关系数为 0.24。在相同数据集上与其他方法的比较表明,我们的模型优于其他方法。
基于特征向量的冗余减少方法非常强大,可用于构建学习模型的非冗余训练数据集,因为它生成的数据集比标准的基于序列相似性的方法更大,且包含非冗余数据。在预测蛋白质序列中的 RNA 结合残基时,将 RNA 和蛋白质序列的特征包含在特征向量中比仅使用蛋白质特征的效果更好。