• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

可成药蛋白的综合研究:从位置特异性得分矩阵到预训练语言模型

Comprehensive Research on Druggable Proteins: From PSSM to Pre-Trained Language Models.

作者信息

Chu Hongkang, Liu Taigang

机构信息

College of Information Technology, Shanghai Ocean University, Shanghai 201306, China.

出版信息

Int J Mol Sci. 2024 Apr 19;25(8):4507. doi: 10.3390/ijms25084507.

DOI:10.3390/ijms25084507
PMID:38674091
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11049818/
Abstract

Identification of druggable proteins can greatly reduce the cost of discovering new potential drugs. Traditional experimental approaches to exploring these proteins are often costly, slow, and labor-intensive, making them impractical for large-scale research. In response, recent decades have seen a rise in computational methods. These alternatives support drug discovery by creating advanced predictive models. In this study, we proposed a fast and precise classifier for the identification of druggable proteins using a protein language model (PLM) with fine-tuned evolutionary scale modeling 2 (ESM-2) embeddings, achieving 95.11% accuracy on the benchmark dataset. Furthermore, we made a careful comparison to examine the predictive abilities of ESM-2 embeddings and position-specific scoring matrix (PSSM) features by using the same classifiers. The results suggest that ESM-2 embeddings outperformed PSSM features in terms of accuracy and efficiency. Recognizing the potential of language models, we also developed an end-to-end model based on the generative pre-trained transformers 2 (GPT-2) with modifications. To our knowledge, this is the first time a large language model (LLM) GPT-2 has been deployed for the recognition of druggable proteins. Additionally, a more up-to-date dataset, known as Pharos, was adopted to further validate the performance of the proposed model.

摘要

可成药蛋白的识别能够大幅降低发现新潜在药物的成本。探索这些蛋白的传统实验方法通常成本高昂、速度缓慢且 labor-intensive,这使得它们对于大规模研究而言并不实用。作为回应,近几十年来计算方法有所兴起。这些替代方法通过创建先进的预测模型来支持药物发现。在本研究中,我们提出了一种快速且精确的分类器,用于使用具有微调进化尺度建模2(ESM-2)嵌入的蛋白质语言模型(PLM)来识别可成药蛋白,在基准数据集上达到了95.11%的准确率。此外,我们通过使用相同的分类器进行了仔细比较,以检验ESM-2嵌入和位置特异性评分矩阵(PSSM)特征的预测能力。结果表明,ESM-2嵌入在准确性和效率方面优于PSSM特征。认识到语言模型的潜力,我们还开发了一个基于经过修改的生成式预训练变换器2(GPT-2)的端到端模型。据我们所知,这是首次将大型语言模型(LLM)GPT-2用于可成药蛋白的识别。此外,采用了一个更新的数据集,即Pharos,以进一步验证所提出模型的性能。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/3442/11049818/559e59214159/ijms-25-04507-g012a.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/3442/11049818/c6dd27ad46ca/ijms-25-04507-g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/3442/11049818/afd2ce55a790/ijms-25-04507-g002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/3442/11049818/11b8d8c58967/ijms-25-04507-g003a.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/3442/11049818/680b964a29d1/ijms-25-04507-g004.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/3442/11049818/c4fb41900397/ijms-25-04507-g005.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/3442/11049818/d7cbe451d52d/ijms-25-04507-g006.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/3442/11049818/0f558d5185e3/ijms-25-04507-g007.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/3442/11049818/3daec4863346/ijms-25-04507-g008.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/3442/11049818/d954b998b60e/ijms-25-04507-g009.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/3442/11049818/8bd4f0a72649/ijms-25-04507-g010.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/3442/11049818/8ddcc07f613f/ijms-25-04507-g011.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/3442/11049818/559e59214159/ijms-25-04507-g012a.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/3442/11049818/c6dd27ad46ca/ijms-25-04507-g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/3442/11049818/afd2ce55a790/ijms-25-04507-g002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/3442/11049818/11b8d8c58967/ijms-25-04507-g003a.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/3442/11049818/680b964a29d1/ijms-25-04507-g004.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/3442/11049818/c4fb41900397/ijms-25-04507-g005.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/3442/11049818/d7cbe451d52d/ijms-25-04507-g006.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/3442/11049818/0f558d5185e3/ijms-25-04507-g007.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/3442/11049818/3daec4863346/ijms-25-04507-g008.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/3442/11049818/d954b998b60e/ijms-25-04507-g009.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/3442/11049818/8bd4f0a72649/ijms-25-04507-g010.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/3442/11049818/8ddcc07f613f/ijms-25-04507-g011.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/3442/11049818/559e59214159/ijms-25-04507-g012a.jpg

相似文献

1
Comprehensive Research on Druggable Proteins: From PSSM to Pre-Trained Language Models.可成药蛋白的综合研究:从位置特异性得分矩阵到预训练语言模型
Int J Mol Sci. 2024 Apr 19;25(8):4507. doi: 10.3390/ijms25084507.
2
DPI_CDF: druggable protein identifier using cascade deep forest.DPI_CDF:基于级联深度森林的可成药性蛋白识别方法。
BMC Bioinformatics. 2024 Apr 5;25(1):145. doi: 10.1186/s12859-024-05744-3.
3
Comparative Analysis on Alignment-Based and Pretrained Feature Representations for the Identification of DNA-Binding Proteins.基于比对和基于预训练特征表示的 DNA 结合蛋白鉴定的比较分析。
Comput Math Methods Med. 2022 Jun 28;2022:5847242. doi: 10.1155/2022/5847242. eCollection 2022.
4
Hybrid framework for membrane protein type prediction based on the PSSM.基于 PSSM 的膜蛋白类型预测的混合框架。
Sci Rep. 2024 Jul 26;14(1):17156. doi: 10.1038/s41598-024-68163-7.
5
Target-DBPPred: An intelligent model for prediction of DNA-binding proteins using discrete wavelet transform based compression and light eXtreme gradient boosting.目标-DBPPred:一种使用基于离散小波变换的压缩和轻极限梯度提升的智能 DNA 结合蛋白预测模型。
Comput Biol Med. 2022 Jun;145:105533. doi: 10.1016/j.compbiomed.2022.105533. Epub 2022 Apr 16.
6
PPIevo: protein-protein interaction prediction from PSSM based evolutionary information.PPIevo:基于 PSSM 的进化信息的蛋白质-蛋白质相互作用预测。
Genomics. 2013 Oct;102(4):237-42. doi: 10.1016/j.ygeno.2013.05.006. Epub 2013 Jun 6.
7
Ensemble Learning Prediction of Drug-Target Interactions Using GIST Descriptor Extracted from PSSM-Based Evolutionary Information.基于 PSSM 进化信息提取的 GIST 描述符的药物-靶标相互作用的集成学习预测。
Biomed Res Int. 2020 Aug 21;2020:4516250. doi: 10.1155/2020/4516250. eCollection 2020.
8
StackedEnC-AOP: prediction of antioxidant proteins using transform evolutionary and sequential features based multi-scale vector with stacked ensemble learning.StackedEnC-AOP:基于多尺度向量的转换进化和序列特征与堆叠集成学习预测抗氧化蛋白。
BMC Bioinformatics. 2024 Aug 4;25(1):256. doi: 10.1186/s12859-024-05884-6.
9
PLMACPred prediction of anticancer peptides based on protein language model and wavelet denoising transformation.基于蛋白质语言模型和小波去噪变换的抗癌肽 PLMACPred 预测。
Sci Rep. 2024 Jul 23;14(1):16992. doi: 10.1038/s41598-024-67433-8.
10
Drug-Target Interaction Prediction Based on Drug Fingerprint Information and Protein Sequence.基于药物指纹信息和蛋白质序列的药物-靶标相互作用预测。
Molecules. 2019 Aug 19;24(16):2999. doi: 10.3390/molecules24162999.

引用本文的文献

1
Explainable Deep Multilevel Attention Learning for Predicting Protein Carbonylation Sites.用于预测蛋白质羰基化位点的可解释深度多级注意力学习
Adv Sci (Weinh). 2025 Jun;12(23):e2500581. doi: 10.1002/advs.202500581. Epub 2025 Mar 27.
2
Unraveling druggable cancer-driving proteins and targeted drugs using artificial intelligence and multi-omics analyses.利用人工智能和多组学分析揭示可成药的癌症驱动蛋白和靶向药物。
Sci Rep. 2024 Aug 21;14(1):19359. doi: 10.1038/s41598-024-68565-7.

本文引用的文献

1
Machine learning-based model for accurate identification of druggable proteins using light extreme gradient boosting.基于机器学习的光极限梯度提升模型,用于准确识别可成药蛋白。
J Biomol Struct Dyn. 2024;42(22):12330-12341. doi: 10.1080/07391102.2023.2269280. Epub 2023 Oct 18.
2
PINNED: identifying characteristics of druggable human proteins using an interpretable neural network.PINNED:使用可解释神经网络识别可成药人类蛋白质的特征
J Cheminform. 2023 Jul 19;15(1):64. doi: 10.1186/s13321-023-00735-7.
3
Evolutionary-scale prediction of atomic-level protein structure with a language model.
用语言模型进行原子级蛋白质结构的进化尺度预测。
Science. 2023 Mar 17;379(6637):1123-1130. doi: 10.1126/science.ade2574. Epub 2023 Mar 16.
4
Pharos 2023: an integrated resource for the understudied human proteome.Pharos 2023:一个针对人类蛋白质组中未被充分研究的部分的综合资源。
Nucleic Acids Res. 2023 Jan 6;51(D1):D1405-D1416. doi: 10.1093/nar/gkac1033.
5
DrugnomeAI is an ensemble machine-learning framework for predicting druggability of candidate drug targets.DrugnomeAI 是一个用于预测候选药物靶点可药性的集成机器学习框架。
Commun Biol. 2022 Nov 24;5(1):1291. doi: 10.1038/s42003-022-04245-4.
6
The applications of deep learning algorithms on in silico druggable proteins identification.深度学习算法在虚拟可成药蛋白识别中的应用。
J Adv Res. 2022 Nov;41:219-231. doi: 10.1016/j.jare.2022.01.009. Epub 2022 Jan 22.
7
Probabilistic Pocket Druggability Prediction One-Class Learning.概率口袋可成药预测:单类学习
Front Pharmacol. 2022 Jun 29;13:870479. doi: 10.3389/fphar.2022.870479. eCollection 2022.
8
XGB-DrugPred: computational prediction of druggable proteins using eXtreme gradient boosting and optimized features set.XGB-DrugPred:使用极端梯度提升和优化特征集的可药物蛋白计算预测。
Sci Rep. 2022 Apr 1;12(1):5505. doi: 10.1038/s41598-022-09484-3.
9
ProtTrans: Toward Understanding the Language of Life Through Self-Supervised Learning.ProtTrans:通过自监督学习理解生命语言。
IEEE Trans Pattern Anal Mach Intell. 2022 Oct;44(10):7112-7127. doi: 10.1109/TPAMI.2021.3095381. Epub 2022 Sep 14.
10
Unified rational protein engineering with sequence-based deep representation learning.基于序列的深度学习表示的统一理性蛋白质工程。
Nat Methods. 2019 Dec;16(12):1315-1322. doi: 10.1038/s41592-019-0598-1. Epub 2019 Oct 21.