Zuo Yun, Wan Minquan, Shen Yang, Wang Xinheng, He Wenying, Bi Yue, Liu Xiangrong, Deng Zhaohong
School of Artificial Intelligence and Computer Science, Jiangnan University, Wuxi 214000, China.
School of Artificial Intelligence and Computer Science, Jiangnan University, Wuxi 214000, China.
Comput Biol Chem. 2024 Dec;113:108212. doi: 10.1016/j.compbiolchem.2024.108212. Epub 2024 Sep 13.
Protein lysine crotonylation is an important post-translational modification that regulates various cellular activities. For example, histone crotonylation affects chromatin structure and promotes histone replacement. Identification and understanding of lysine crotonylation sites is crucial in the field of protein research. However, due to the increasing amount of non-histone crotonylation sites, existing classifiers based on traditional machine learning may encounter performance limitations. In order to address this problem, a novel deep learning-based model for identifying crotonylation sites is presented in this study, given the unique advantages of deep learning techniques for sequence data analysis. In this study, an MLP-Attention-based model was developed for the identification of crotonylation sites. Firstly, three feature extraction strategies, namely Amino Acid Composition, K-mer, and Distance-based residue features extraction strategy, were used to encode crotonylated and non-crotonylated sequences. Then, in order to balance the training dataset, the FCM-GRNN undersampling algorithm combining fuzzy clustering and generalized neural network approaches was introduced. Finally, to improve the effectiveness of crotonylation site identification, we explored various classification algorithms, and based on the relevant experimental performance comparisons, the multilayer perceptron (MLP) combined with the superimposed self-attention mechanism was finally selected to construct the prediction model ILYCROsite. The results obtained from independent testing and five-fold cross-validation demonstrated that the model proposed in this study, ILYCROsite, had excellent performance. Notably, on the independent test set, ILYCROsite achieves an AUC value of 87.93 %, which is significantly better than the existing state-of-the-art models. In addition, SHAP (Shapley Additive exPlanations) values were used to analyze the importance of features and their impact on model predictions. Meanwhile, in order to facilitate researchers to use the prediction model constructed in this study, we developed a prediction program to identify the crotonylation sites in a given protein sequence. The data and code for this program are available at: https://github.com/wmqskr/ILYCROsite.
蛋白质赖氨酸巴豆酰化是一种重要的翻译后修饰,可调节各种细胞活动。例如,组蛋白巴豆酰化会影响染色质结构并促进组蛋白置换。赖氨酸巴豆酰化位点的识别和理解在蛋白质研究领域至关重要。然而,由于非组蛋白巴豆酰化位点数量不断增加,基于传统机器学习的现有分类器可能会遇到性能限制。鉴于深度学习技术在序列数据分析方面的独特优势,本研究提出了一种基于深度学习的新型巴豆酰化位点识别模型。在本研究中,开发了一种基于MLP-注意力的模型来识别巴豆酰化位点。首先,使用三种特征提取策略,即氨基酸组成、K-mer和基于距离的残基特征提取策略,对巴豆酰化和非巴豆酰化序列进行编码。然后,为了平衡训练数据集,引入了结合模糊聚类和广义神经网络方法的FCM-GRNN欠采样算法。最后,为了提高巴豆酰化位点识别的有效性,我们探索了各种分类算法,并基于相关实验性能比较,最终选择了结合叠加自注意力机制的多层感知器(MLP)来构建预测模型ILYCROsite。独立测试和五折交叉验证的结果表明,本研究提出的模型ILYCROsite具有优异的性能。值得注意的是,在独立测试集上,ILYCROsite的AUC值达到87.93%,明显优于现有的最先进模型。此外,使用SHAP(Shapley Additive exPlanations)值来分析特征的重要性及其对模型预测的影响。同时,为了方便研究人员使用本研究构建的预测模型,我们开发了一个预测程序来识别给定蛋白质序列中的巴豆酰化位点。该程序的数据和代码可在以下网址获取:https://github.com/wmqskr/ILYCROsite。