可成药蛋白的综合研究：从位置特异性得分矩阵到预训练语言模型

Comprehensive Research on Druggable Proteins: From PSSM to Pre-Trained Language Models.

作者信息

Chu Hongkang, Liu Taigang

机构信息

College of Information Technology, Shanghai Ocean University, Shanghai 201306, China.

出版信息

Int J Mol Sci. 2024 Apr 19;25(8):4507. doi: 10.3390/ijms25084507.

DOI:10.3390/ijms25084507

PMID:38674091

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11049818/

Abstract

Identification of druggable proteins can greatly reduce the cost of discovering new potential drugs. Traditional experimental approaches to exploring these proteins are often costly, slow, and labor-intensive, making them impractical for large-scale research. In response, recent decades have seen a rise in computational methods. These alternatives support drug discovery by creating advanced predictive models. In this study, we proposed a fast and precise classifier for the identification of druggable proteins using a protein language model (PLM) with fine-tuned evolutionary scale modeling 2 (ESM-2) embeddings, achieving 95.11% accuracy on the benchmark dataset. Furthermore, we made a careful comparison to examine the predictive abilities of ESM-2 embeddings and position-specific scoring matrix (PSSM) features by using the same classifiers. The results suggest that ESM-2 embeddings outperformed PSSM features in terms of accuracy and efficiency. Recognizing the potential of language models, we also developed an end-to-end model based on the generative pre-trained transformers 2 (GPT-2) with modifications. To our knowledge, this is the first time a large language model (LLM) GPT-2 has been deployed for the recognition of druggable proteins. Additionally, a more up-to-date dataset, known as Pharos, was adopted to further validate the performance of the proposed model.

摘要

可成药蛋白的识别能够大幅降低发现新潜在药物的成本。探索这些蛋白的传统实验方法通常成本高昂、速度缓慢且 labor-intensive，这使得它们对于大规模研究而言并不实用。作为回应，近几十年来计算方法有所兴起。这些替代方法通过创建先进的预测模型来支持药物发现。在本研究中，我们提出了一种快速且精确的分类器，用于使用具有微调进化尺度建模2（ESM-2）嵌入的蛋白质语言模型（PLM）来识别可成药蛋白，在基准数据集上达到了95.11%的准确率。此外，我们通过使用相同的分类器进行了仔细比较，以检验ESM-2嵌入和位置特异性评分矩阵（PSSM）特征的预测能力。结果表明，ESM-2嵌入在准确性和效率方面优于PSSM特征。认识到语言模型的潜力，我们还开发了一个基于经过修改的生成式预训练变换器2（GPT-2）的端到端模型。据我们所知，这是首次将大型语言模型（LLM）GPT-2用于可成药蛋白的识别。此外，采用了一个更新的数据集，即Pharos，以进一步验证所提出模型的性能。

Suppr 超能文献

文献检索

文件翻译

深度研究

Suppr 超能文献

文献检索

文件翻译

深度研究

可成药蛋白的综合研究：从位置特异性得分矩阵到预训练语言模型

Comprehensive Research on Druggable Proteins: From PSSM to Pre-Trained Language Models.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

可成药蛋白的综合研究：从位置特异性得分矩阵到预训练语言模型

Comprehensive Research on Druggable Proteins: From PSSM to Pre-Trained Language Models.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献