Department of Computer Science, University of Management and Technology, Lahore, Pakistan.
National Center of Artificial Intelligence, Punjab University College of Information Technology, University of the Punjab, Lahore, Pakistan; Center for Professional & Applied Studies, Lahore, Pakistan.
Anal Biochem. 2021 Feb 15;615:114069. doi: 10.1016/j.ab.2020.114069. Epub 2020 Dec 16.
Deep representations can be used to replace human-engineered representations, as such features are constrained by certain limitations. For the prediction of protein post-translation modifications (PTMs) sites, research community uses different feature extraction techniques applied on Pseudo amino acid compositions (PseAAC). Serine phosphorylation is one of the most important PTM as it is the most occurring, and is important for various biological functions. Creating efficient representations from large protein sequences, to predict PTM sites, is a time and resource intensive task. In this study we propose, implement and evaluate use of Deep learning to learn effective protein data representations from PseAAC to develop data driven PTM detection systems and compare the same with two human representations.. The comparisons are performed by training an xgboost based classifier using each representation. The best scores were achieved by RNN-LSTM based deep representation and CNN based representation with an accuracy score of 81.1% and 78.3% respectively. Human engineered representations scored 77.3% and 74.9% respectively. Based on these results, it is concluded that the deep features are promising feature engineering replacement to identify PhosS sites in a very efficient and accurate manner which can help scientists understand the mechanism of this modification in proteins.
深度表示可以替代人工设计的表示,因为这些特征受到某些限制。对于预测蛋白质翻译后修饰(PTM)位点,研究界使用不同的特征提取技术应用于伪氨基酸组成(PseAAC)。丝氨酸磷酸化是最重要的 PTM 之一,因为它是最常见的,对各种生物功能很重要。从大型蛋白质序列中创建有效的表示形式来预测 PTM 位点是一项耗时且资源密集型的任务。在这项研究中,我们提出、实现和评估了使用深度学习从 PseAAC 中学习有效的蛋白质数据表示,以开发数据驱动的 PTM 检测系统,并将其与两种人工表示进行比较。通过使用每个表示来训练基于 xgboost 的分类器进行比较。最佳分数是由基于 RNN-LSTM 的深度表示和基于 CNN 的表示获得的,准确率分别为 81.1%和 78.3%。人工设计的表示分别获得了 77.3%和 74.9%的分数。基于这些结果,可以得出结论,深度特征是有前途的特征工程替代方法,可以非常高效和准确地识别 PhosS 位点,这有助于科学家理解蛋白质中这种修饰的机制。