College of Science and Engineering, Hamad Bin Khalifa University, Doha, Qatar.
Department of Microbiology, Abdul Wali Khan University, Mardan, KPK, Pakistan.
Sci Rep. 2024 Jul 23;14(1):16992. doi: 10.1038/s41598-024-67433-8.
Anticancer peptides (ACPs) perform a promising role in discovering anti-cancer drugs. The growing research on ACPs as therapeutic agent is increasing due to its minimal side effects. However, identifying novel ACPs using wet-lab experiments are generally time-consuming, labor-intensive, and expensive. Leveraging computational methods for fast and accurate prediction of ACPs would harness the drug discovery process. Herein, a machine learning-based predictor, called PLMACPred, is developed for identifying ACPs from peptide sequence only. PLMACPred adopted a set of encoding schemes representing evolutionary-property, composition-property, and protein language model (PLM), i.e., evolutionary scale modeling (ESM-2)- and ProtT5-based embedding to encode peptides. Then, two-dimensional (2D) wavelet denoising (WD) was employed to remove the noise from extracted features. Finally, ensemble-based cascade deep forest (CDF) model was developed to identify ACP. PLMACPred model attained superior performance on all three benchmark datasets, namely, ACPmain, ACPAlter, and ACP740 over tenfold cross validation and independent dataset. PLMACPred outperformed the existing models and improved the prediction accuracy by 18.53%, 2.4%, 7.59% on ACPmain, ACPalter, ACP740 dataset, respectively. We showed that embedding from ProtT5 and ESM-2 was capable of capturing better contextual information from the entire sequence than the other encoding schemes for ACP prediction. For the explainability of proposed model, SHAP (SHapley Additive exPlanations) method was used to analyze the feature effect on the ACP prediction. A list of novel sequence motifs was proposed from the ACP sequence using MEME suites. We believe, PLMACPred will support in accelerating the discovery of novel ACPs as well as other activities of microbial peptides.
抗癌肽 (ACPs) 在发现抗癌药物方面发挥着有前景的作用。由于其副作用极小,对抗癌肽作为治疗剂的研究不断增加。然而,使用湿实验室实验识别新的 ACP 通常既费时、费力又昂贵。利用计算方法快速准确地预测 ACP 将利用药物发现过程。在这里,开发了一种基于机器学习的预测器,称为 PLMACPred,用于仅从肽序列识别 ACP。PLMACPred 采用了一组表示进化特性、组成特性和蛋白质语言模型 (PLM) 的编码方案,即基于进化尺度建模 (ESM-2) 和 ProtT5 的嵌入来编码肽。然后,采用二维 (2D) 小波去噪 (WD) 从提取的特征中去除噪声。最后,开发了基于集成的级联深度森林 (CDF) 模型来识别 ACP。PLMACPred 模型在所有三个基准数据集 ACPmain、ACPAlter 和 ACP740 上的十折交叉验证和独立数据集上都表现出卓越的性能。PLMACPred 优于现有的模型,在 ACPmain、ACPalter 和 ACP740 数据集上分别提高了 18.53%、2.4%和 7.59%的预测准确性。我们表明,ProtT5 和 ESM-2 的嵌入能够从整个序列中捕获更好的上下文信息,从而比其他编码方案更有利于 ACP 预测。为了说明提出的模型的可解释性,使用 SHAP(SHapley Additive exPlanations)方法分析特征对 ACP 预测的影响。使用 MEME 套件从 ACP 序列中提出了一系列新的序列基序。我们相信,PLMACPred 将支持加速新型 ACP 的发现以及其他微生物肽的活动。