Wang Hongfei, Wang Zhuo, Li Zhongyan, Lee Tzong-Yi
Warshel Institute for Computational Biology, The Chinese University of Hong Kong, Shenzhen, China.
School of Life Sciences, University of Science and Technology of China, Hefei, China.
Front Cell Dev Biol. 2020 Sep 30;8:572195. doi: 10.3389/fcell.2020.572195. eCollection 2020.
Protein ubiquitylation is an important posttranslational modification (PTM), which is involved in diverse biological processes and plays an essential role in the regulation of physiological mechanisms and diseases. The Protein Lysine Modifications Database (PLMD) has accumulated abundant ubiquitylated proteins with their substrate sites for more than 20 kinds of species. Numerous works have consequently developed a variety of ubiquitylation site prediction tools across all species, mainly relying on the predefined sequence features and machine learning algorithms. However, the difference in ubiquitylated patterns between these species stays unclear. In this work, the sequence-based characterization of ubiquitylated substrate sites has revealed remarkable differences among plants, animals, and fungi. Then an improved word-embedding scheme based on the transfer learning strategy was incorporated with the multilayer convolutional neural network (CNN) for identifying protein ubiquitylation sites. For the prediction of plant ubiquitylation sites, the proposed deep learning scheme could outperform the machine learning-based methods, with the accuracy of 75.6%, precision of 73.3%, recall of 76.7%, F-score of 0.7493, and 0.82 AUC on the independent testing set. Although the ubiquitylated specificity of substrate sites is complicated, this work has demonstrated that the application of the word-embedding method can enable the extraction of informative features and help the identification of ubiquitylated sites. To accelerate the investigation of protein ubiquitylation, the data sets and source code used in this study are freely available at https://github.com/wang-hong-fei/DL-plant-ubsites-prediction.
蛋白质泛素化是一种重要的翻译后修饰(PTM),它参与多种生物过程,在生理机制调节和疾病发生中起着至关重要的作用。蛋白质赖氨酸修饰数据库(PLMD)已积累了20多种物种的大量泛素化蛋白质及其底物位点。因此,许多研究针对所有物种开发了各种泛素化位点预测工具,主要依赖于预定义的序列特征和机器学习算法。然而,这些物种之间泛素化模式的差异仍不清楚。在这项研究中,基于序列的泛素化底物位点特征揭示了植物、动物和真菌之间存在显著差异。然后,将基于迁移学习策略改进的词嵌入方案与多层卷积神经网络(CNN)相结合,用于识别蛋白质泛素化位点。对于植物泛素化位点的预测,所提出的深度学习方案优于基于机器学习的方法,在独立测试集上的准确率为75.6%,精确率为73.3%,召回率为76.7%,F值为0.7493,曲线下面积(AUC)为0.82。尽管底物位点的泛素化特异性很复杂,但这项研究表明,词嵌入方法的应用能够提取信息特征并有助于识别泛素化位点。为了加速蛋白质泛素化的研究,本研究中使用的数据集和源代码可在https://github.com/wang-hong-fei/DL-plant-ubsites-prediction上免费获取。