Zuo Yun, Fang Xingze, Chen Jiankang, Ji Jiayi, Li Yuwen, Wu Zeyu, Liu Xiangrong, Zeng Xiangxiang, Deng Zhaohong, Yin Hongwei, Zhao Anjing
School of Artificial Intelligence and Computer Science, Jiangnan University, Wuxi 214000, China.
Department of Computer Science and Technology, National Institute for Data Science in Health and Medicine, Xiamen Key Laboratory of Intelligent Storage and Computing, Xiamen University, Xiamen 361005, China.
Brief Bioinform. 2025 Mar 4;26(2). doi: 10.1093/bib/bbaf189.
In post-translational modification, covalent bonds on lysine and attached chemical groups significantly change proteins' physical and chemical properties. They shape protein structures, enhance function and stability, and are vital for physiological processes, affecting health and disease through mechanisms like gene expression, signal transduction, protein degradation, and cell metabolism. Although lysine (K) modification sites are considered among the most common types of post-translational modifications in proteins, research on K-PTMs has largely overlooked the synergistic effects between different modifications and lacked the techniques to address the problem of sample imbalance. Based on this, the Extreme Point Deviation Compensated Clustering (EPDCC) Undersampling algorithm was proposed in this study and combined with Cross-Scale Convolutional Neural Networks (CSCNNs) to develop a novel computational tool, MlyPredCSED, for simultaneously predicting multiple lysine modification sites. MlyPredCSED employs Multi-Label Position-Specific Triad Amino Acid Propensity and the physicochemical properties of amino acids to enhance the richness of sequence information. To address the challenge of sample imbalance, the innovative EPDCC Undersampling technique was introduced to adjust the majority class samples. The model's training and testing phase relies on the advanced CSCNN framework. MlyPredCSED, through cross-validation and testing, outperformed existing models, especially in complex categories with multiple modification sites. This research not only provides an efficient method for the identification of lysine modification sites but also demonstrates its value in biological research and drug development. To facilitate efficient use of MlyPredCSED by researchers, we have specifically developed an accessible free web tool: http://www.mlypredcsed.com.
在翻译后修饰中,赖氨酸上的共价键和连接的化学基团会显著改变蛋白质的物理和化学性质。它们塑造蛋白质结构,增强功能和稳定性,对生理过程至关重要,通过基因表达、信号转导、蛋白质降解和细胞代谢等机制影响健康和疾病。尽管赖氨酸(K)修饰位点被认为是蛋白质翻译后修饰中最常见的类型之一,但对K-翻译后修饰的研究在很大程度上忽略了不同修饰之间的协同效应,并且缺乏解决样本不平衡问题的技术。基于此,本研究提出了极端点偏差补偿聚类(EPDCC)欠采样算法,并与跨尺度卷积神经网络(CSCNNs)相结合,开发了一种新型计算工具MlyPredCSED,用于同时预测多个赖氨酸修饰位点。MlyPredCSED采用多标签位置特异性三联体氨基酸倾向和氨基酸的物理化学性质来增强序列信息的丰富性。为应对样本不平衡的挑战,引入了创新的EPDCC欠采样技术来调整多数类样本。该模型的训练和测试阶段依赖于先进的CSCNN框架。通过交叉验证和测试,MlyPredCSED的性能优于现有模型,尤其是在具有多个修饰位点的复杂类别中。本研究不仅为赖氨酸修饰位点的识别提供了一种有效方法,还证明了其在生物学研究和药物开发中的价值。为便于研究人员高效使用MlyPredCSED,我们专门开发了一个易于访问的免费网络工具:http://www.mlypredcsed.com。