Department of Plant Science, Faculty of Science, Tarbiat Modarres University, Tehran, Iran.
Department of Biophysics, Faculty of Biological Science, Tarbiat Modares University, Tehran, Iran.
Brief Bioinform. 2022 Mar 10;23(2). doi: 10.1093/bib/bbac015.
Phosphorylation of proteins is one of the most significant post-translational modifications (PTMs) and plays a crucial role in plant functionality due to its impact on signaling, gene expression, enzyme kinetics, protein stability and interactions. Accurate prediction of plant phosphorylation sites (p-sites) is vital as abnormal regulation of phosphorylation usually leads to plant diseases. However, current experimental methods for PTM prediction suffers from high-computational cost and are error-prone. The present study develops machine learning-based prediction techniques, including a high-performance interpretable deep tabular learning network (TabNet) to improve the prediction of protein p-sites in soybean. Moreover, we use a hybrid feature set of sequential-based features, physicochemical properties and position-specific scoring matrices to predict serine (Ser/S), threonine (Thr/T) and tyrosine (Tyr/Y) p-sites in soybean for the first time. The experimentally verified p-sites data of soybean proteins are collected from the eukaryotic phosphorylation sites database and database post-translational modification. We then remove the redundant set of positive and negative samples by dropping protein sequences with >40% similarity. It is found that the developed techniques perform >70% in terms of accuracy. The results demonstrate that the TabNet model is the best performing classifier using hybrid features and with window size of 13, resulted in 78.96 and 77.24% sensitivity and specificity, respectively. The results indicate that the TabNet method has advantages in terms of high-performance and interpretability. The proposed technique can automatically analyze the data without any measurement errors and any human intervention. Furthermore, it can be used to predict putative protein p-sites in plants effectively. The collected dataset and source code are publicly deposited at https://github.com/Elham-khalili/Soybean-P-sites-Prediction.
蛋白质磷酸化是最重要的翻译后修饰(PTMs)之一,由于其对信号转导、基因表达、酶动力学、蛋白质稳定性和相互作用的影响,在植物功能中起着至关重要的作用。准确预测植物磷酸化位点(p-sites)至关重要,因为磷酸化的异常调节通常会导致植物疾病。然而,目前用于 PTM 预测的实验方法存在计算成本高和易错的问题。本研究开发了基于机器学习的预测技术,包括高性能可解释的深度表格学习网络(TabNet),以提高对大豆中蛋白质 p-sites 的预测。此外,我们首次使用基于序列的特征、理化性质和位置特异性评分矩阵的混合特征集来预测大豆中的丝氨酸(Ser/S)、苏氨酸(Thr/T)和酪氨酸(Tyr/Y)p-sites。从真核磷酸化位点数据库和数据库翻译后修饰中收集大豆蛋白质的实验验证的 p-sites 数据。然后,通过删除具有>40%相似性的蛋白质序列来去除正、负样本的冗余集。结果表明,所开发的技术在准确性方面的性能超过 70%。结果表明,TabNet 模型是使用混合特征和窗口大小为 13 的最佳分类器,分别得到 78.96%和 77.24%的灵敏度和特异性。结果表明,TabNet 方法在高性能和可解释性方面具有优势。该技术可以自动分析数据,而不会产生任何测量误差或任何人为干预。此外,它可以有效地用于预测植物中的假定蛋白质 p-sites。收集的数据集和源代码在 https://github.com/Elham-khalili/Soybean-P-sites-Prediction 上公开存储。