Zeng Ying, Chen Yuan, Yuan Zheming
School of Computer and Communication, Hunan Institute of Engineering, Xiangtan, 411104, Hunan, China.
Hunan Engineering & Technology Research Center for Agricultural Big Data Analysis & Decision-making, Hunan Agricultural University, Changsha, 410128, Hunan, China.
BioData Min. 2022 Feb 10;15(1):3. doi: 10.1186/s13040-022-00290-1.
Lysine succinylation is a type of protein post-translational modification which is widely involved in cell differentiation, cell metabolism and other important physiological activities. To study the molecular mechanism of succinylation in depth, succinylation sites need to be accurately identified, and because experimental approaches are costly and time-consuming, there is a great demand for reliable computational methods. Feature extraction is a key step in building succinylation site prediction models, and the development of effective new features improves predictive accuracy. Because the number of false succinylation sites far exceeds that of true sites, traditional classifiers perform poorly, and designing a classifier to effectively handle highly imbalanced datasets has always been a challenge.
A new computational method, iSuc-ChiDT, is proposed to identify succinylation sites in proteins. In iSuc-ChiDT, chi-square statistical difference table encoding is developed to extract positional features, and has a higher predictive accuracy and fewer features compared to common position-based encoding schemes such as binary encoding and physicochemical property encoding. Single amino acid and undirected pair-coupled amino acid composition features are supplemented to improve the fault tolerance for residue insertions and deletions. After feature selection by Chi-MIC-share algorithm, the chi-square decision table (ChiDT) classifier is constructed for imbalanced classification. With a training set of 4748:50,551(true: false sites), ChiDT clearly outperforms traditional classifiers in predictive accuracy, and runs fast. Using an independent testing set of experimentally identified succinylation sites, iSuc-ChiDT achieves a sensitivity of 70.47%, a specificity of 66.27%, a Matthews correlation coefficient of 0.205, and a global accuracy index Q of 0.683, showing a significant improvement in sensitivity and overall accuracy compared to PSuccE, Success, SuccinSite, and other existing succinylation site predictors.
iSuc-ChiDT shows great promise in predicting succinylation sites and is expected to facilitate further experimental investigation of protein succinylation.
赖氨酸琥珀酰化是一种蛋白质翻译后修饰,广泛参与细胞分化、细胞代谢等重要生理活动。为深入研究琥珀酰化的分子机制,需要准确鉴定琥珀酰化位点,由于实验方法成本高且耗时,因此对可靠的计算方法有很大需求。特征提取是构建琥珀酰化位点预测模型的关键步骤,开发有效的新特征可提高预测准确性。由于假琥珀酰化位点的数量远远超过真位点,传统分类器表现不佳,设计一个能有效处理高度不平衡数据集的分类器一直是一项挑战。
提出了一种新的计算方法iSuc-ChiDT来鉴定蛋白质中的琥珀酰化位点。在iSuc-ChiDT中,开发了卡方统计差异表编码来提取位置特征,与二进制编码和理化性质编码等常见的基于位置的编码方案相比,具有更高的预测准确性和更少的特征。补充了单氨基酸和无向对耦合氨基酸组成特征,以提高对残基插入和缺失的容错能力。通过卡方互信息共享算法进行特征选择后,构建卡方决策表(ChiDT)分类器用于不平衡分类。在4748:50551(真:假位点)的训练集上,ChiDT在预测准确性方面明显优于传统分类器,且运行速度快。使用实验鉴定的琥珀酰化位点的独立测试集,iSuc-ChiDT的灵敏度为70.47%,特异性为66.27%,马修斯相关系数为0.205,全局准确性指数Q为0.683,与PSuccE、Success、SuccinSite等现有琥珀酰化位点预测器相比,灵敏度和整体准确性有显著提高。
iSuc-ChiDT在预测琥珀酰化位点方面显示出巨大潜力,有望促进蛋白质琥珀酰化的进一步实验研究。