Wu Jian-Sheng, Zhou Zhi-Hua
Nanjing University and Nanjing University of Posts and Telecommunications, Nanjing.
IEEE/ACM Trans Comput Biol Bioinform. 2013 May-Jun;10(3):752-9. doi: 10.1109/TCBB.2013.75.
The recognition of microRNA (miRNA)-binding residues in proteins is helpful to understand how miRNAs silence their target genes. It is difficult to use existing computational method to predict miRNA-binding residues in proteins due to the lack of training examples. To address this issue, unlabeled data may be exploited to help construct a computational model. Semisupervised learning deals with methods for exploiting unlabeled data in addition to labeled data automatically to improve learning performance, where no human intervention is assumed. In addition, miRNA-binding proteins almost always contain a much smaller number of binding than nonbinding residues, and cost-sensitive learning has been deemed as a good solution to the class imbalance problem. In this work, a novel model is proposed for recognizing miRNA-binding residues in proteins from sequences using a cost-sensitive extension of Laplacian support vector machines (CS-LapSVM) with a hybrid feature. The hybrid feature consists of evolutionary information of the amino acid sequence (position-specific scoring matrices), the conservation information about three biochemical properties (HKM) and mutual interaction propensities in protein-miRNA complex structures. The CS-LapSVM receives good performance with an F1 score of 26.23 ± 2.55% and an AUC value of 0.805 ± 0.020 superior to existing approaches for the recognition of RNA-binding residues. A web server called SARS is built and freely available for academic usage.
识别蛋白质中的微小RNA(miRNA)结合残基有助于理解miRNA如何使靶基因沉默。由于缺乏训练示例,使用现有的计算方法预测蛋白质中的miRNA结合残基具有一定难度。为了解决这个问题,可以利用未标记数据来帮助构建计算模型。半监督学习涉及除了自动利用标记数据外,还利用未标记数据来提高学习性能的方法,其中假定没有人工干预。此外,miRNA结合蛋白中结合残基的数量几乎总是比非结合残基少得多,而代价敏感学习被认为是解决类不平衡问题的一个好方法。在这项工作中,提出了一种新的模型,该模型使用具有混合特征的拉普拉斯支持向量机(CS-LapSVM)的代价敏感扩展,从序列中识别蛋白质中的miRNA结合残基。混合特征由氨基酸序列的进化信息(位置特异性得分矩阵)、关于三种生化特性的保守信息(HKM)以及蛋白质-miRNA复合物结构中的相互作用倾向组成。CS-LapSVM的F1分数为26.23±2.55%,AUC值为0.805±0.020,性能良好,优于现有的RNA结合残基识别方法。构建了一个名为SARS的网络服务器,可供学术使用。