Suppr超能文献

可成药蛋白的综合研究:从位置特异性得分矩阵到预训练语言模型

Comprehensive Research on Druggable Proteins: From PSSM to Pre-Trained Language Models.

作者信息

Chu Hongkang, Liu Taigang

机构信息

College of Information Technology, Shanghai Ocean University, Shanghai 201306, China.

出版信息

Int J Mol Sci. 2024 Apr 19;25(8):4507. doi: 10.3390/ijms25084507.

Abstract

Identification of druggable proteins can greatly reduce the cost of discovering new potential drugs. Traditional experimental approaches to exploring these proteins are often costly, slow, and labor-intensive, making them impractical for large-scale research. In response, recent decades have seen a rise in computational methods. These alternatives support drug discovery by creating advanced predictive models. In this study, we proposed a fast and precise classifier for the identification of druggable proteins using a protein language model (PLM) with fine-tuned evolutionary scale modeling 2 (ESM-2) embeddings, achieving 95.11% accuracy on the benchmark dataset. Furthermore, we made a careful comparison to examine the predictive abilities of ESM-2 embeddings and position-specific scoring matrix (PSSM) features by using the same classifiers. The results suggest that ESM-2 embeddings outperformed PSSM features in terms of accuracy and efficiency. Recognizing the potential of language models, we also developed an end-to-end model based on the generative pre-trained transformers 2 (GPT-2) with modifications. To our knowledge, this is the first time a large language model (LLM) GPT-2 has been deployed for the recognition of druggable proteins. Additionally, a more up-to-date dataset, known as Pharos, was adopted to further validate the performance of the proposed model.

摘要

可成药蛋白的识别能够大幅降低发现新潜在药物的成本。探索这些蛋白的传统实验方法通常成本高昂、速度缓慢且 labor-intensive,这使得它们对于大规模研究而言并不实用。作为回应,近几十年来计算方法有所兴起。这些替代方法通过创建先进的预测模型来支持药物发现。在本研究中,我们提出了一种快速且精确的分类器,用于使用具有微调进化尺度建模2(ESM-2)嵌入的蛋白质语言模型(PLM)来识别可成药蛋白,在基准数据集上达到了95.11%的准确率。此外,我们通过使用相同的分类器进行了仔细比较,以检验ESM-2嵌入和位置特异性评分矩阵(PSSM)特征的预测能力。结果表明,ESM-2嵌入在准确性和效率方面优于PSSM特征。认识到语言模型的潜力,我们还开发了一个基于经过修改的生成式预训练变换器2(GPT-2)的端到端模型。据我们所知,这是首次将大型语言模型(LLM)GPT-2用于可成药蛋白的识别。此外,采用了一个更新的数据集,即Pharos,以进一步验证所提出模型的性能。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/3442/11049818/c6dd27ad46ca/ijms-25-04507-g001.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验