Nguyen Van-Nui, Huang Kai-Yao, Huang Chien-Hsun, Lai K Robert, Lee Tzong-Yi
IEEE/ACM Trans Comput Biol Bioinform. 2017 Mar-Apr;14(2):393-403. doi: 10.1109/TCBB.2016.2520939. Epub 2016 Feb 8.
Protein ubiquitination, involving the conjugation of ubiquitin on lysine residue, serves as an important modulator of many cellular functions in eukaryotes. Recent advancements in proteomic technology have stimulated increasing interest in identifying ubiquitination sites. However, most computational tools for predicting ubiquitination sites are focused on small-scale data. With an increasing number of experimentally verified ubiquitination sites, we were motivated to design a predictive model for identifying lysine ubiquitination sites for large-scale proteome dataset. This work assessed not only single features, such as amino acid composition (AAC), amino acid pair composition (AAPC) and evolutionary information, but also the effectiveness of incorporating two or more features into a hybrid approach to model construction. The support vector machine (SVM) was applied to generate the prediction models for ubiquitination site identification. Evaluation by five-fold cross-validation showed that the SVM models learned from the combination of hybrid features delivered a better prediction performance. Additionally, a motif discovery tool, MDDLogo, was adopted to characterize the potential substrate motifs of ubiquitination sites. The SVM models integrating the MDDLogo-identified substrate motifs could yield an average accuracy of 68.70 percent. Furthermore, the independent testing result showed that the MDDLogo-clustered SVM models could provide a promising accuracy (78.50 percent) and perform better than other prediction tools. Two cases have demonstrated the effective prediction of ubiquitination sites with corresponding substrate motifs.
蛋白质泛素化涉及泛素与赖氨酸残基的结合,是真核生物中许多细胞功能的重要调节因子。蛋白质组学技术的最新进展激发了人们对识别泛素化位点的越来越浓厚的兴趣。然而,大多数预测泛素化位点的计算工具都集中在小规模数据上。随着越来越多的泛素化位点通过实验得到验证,我们有动力设计一种预测模型,用于识别大规模蛋白质组数据集中的赖氨酸泛素化位点。这项工作不仅评估了单个特征,如氨基酸组成(AAC)、氨基酸对组成(AAPC)和进化信息,还评估了将两个或更多特征纳入混合方法进行模型构建的有效性。支持向量机(SVM)被用于生成泛素化位点识别的预测模型。通过五折交叉验证进行的评估表明,从混合特征组合中学习到的SVM模型具有更好的预测性能。此外,采用了一种基序发现工具MDDLogo来表征泛素化位点的潜在底物基序。整合了MDDLogo识别的底物基序的SVM模型平均准确率可达68.70%。此外,独立测试结果表明,MDDLogo聚类的SVM模型可以提供可观的准确率(78.50%),并且比其他预测工具表现更好。两个案例已经证明了对具有相应底物基序的泛素化位点的有效预测。