• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

蛋白质 GLUE 多任务基准套件,用于自监督蛋白质建模。

ProteinGLUE multi-task benchmark suite for self-supervised protein modeling.

机构信息

Informatics Institute, Vrije Universiteit, 1081 HV, Amsterdam, The Netherlands.

出版信息

Sci Rep. 2022 Sep 26;12(1):16047. doi: 10.1038/s41598-022-19608-4.

DOI:10.1038/s41598-022-19608-4
PMID:36163232
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC9512797/
Abstract

Self-supervised language modeling is a rapidly developing approach for the analysis of protein sequence data. However, work in this area is heterogeneous and diverse, making comparison of models and methods difficult. Moreover, models are often evaluated only on one or two downstream tasks, making it unclear whether the models capture generally useful properties. We introduce the ProteinGLUE benchmark for the evaluation of protein representations: a set of seven per-amino-acid tasks for evaluating learned protein representations. We also offer reference code, and we provide two baseline models with hyperparameters specifically trained for these benchmarks. Pre-training was done on two tasks, masked symbol prediction and next sentence prediction. We show that pre-training yields higher performance on a variety of downstream tasks such as secondary structure and protein interaction interface prediction, compared to no pre-training. However, the larger base model does not outperform the smaller medium model. We expect the ProteinGLUE benchmark dataset introduced here, together with the two baseline pre-trained models and their performance evaluations, to be of great value to the field of protein sequence-based property prediction. Availability: code and datasets from https://github.com/ibivu/protein-glue .

摘要

自监督语言模型是一种用于分析蛋白质序列数据的快速发展的方法。然而,该领域的工作具有异质性和多样性,使得模型和方法的比较变得困难。此外,模型通常仅在一两个下游任务上进行评估,因此不清楚模型是否捕获了普遍有用的特性。我们引入了 ProteinGLUE 基准测试来评估蛋白质表示:一组七个基于每个氨基酸的任务,用于评估学习到的蛋白质表示。我们还提供了参考代码,并提供了两个针对这些基准测试专门训练的超参数基线模型。预训练是在两个任务上完成的,掩蔽符号预测和下一个句子预测。我们表明,与无预训练相比,预训练在各种下游任务(如二级结构和蛋白质相互作用界面预测)上可获得更高的性能。然而,较大的基础模型并不优于较小的中型模型。我们希望这里引入的 ProteinGLUE 基准测试数据集,以及两个经过预训练的基线模型及其性能评估,将对基于蛋白质序列的属性预测领域具有重要价值。

可获取性

代码和数据集可从 https://github.com/ibivu/protein-glue 获得。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2765/9512797/3ef87626dc49/41598_2022_19608_Fig3_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2765/9512797/3c825cbc3667/41598_2022_19608_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2765/9512797/fc1feb08c0fa/41598_2022_19608_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2765/9512797/3ef87626dc49/41598_2022_19608_Fig3_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2765/9512797/3c825cbc3667/41598_2022_19608_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2765/9512797/fc1feb08c0fa/41598_2022_19608_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2765/9512797/3ef87626dc49/41598_2022_19608_Fig3_HTML.jpg

相似文献

1
ProteinGLUE multi-task benchmark suite for self-supervised protein modeling.蛋白质 GLUE 多任务基准套件,用于自监督蛋白质建模。
Sci Rep. 2022 Sep 26;12(1):16047. doi: 10.1038/s41598-022-19608-4.
2
MolPROP: Molecular Property prediction with multimodal language and graph fusion.MolPROP:通过多模态语言与图形融合进行分子属性预测。
J Cheminform. 2024 May 22;16(1):56. doi: 10.1186/s13321-024-00846-9.
3
ProteinBERT: a universal deep-learning model of protein sequence and function.蛋白质 BERT:一种通用的蛋白质序列和功能深度学习模型。
Bioinformatics. 2022 Apr 12;38(8):2102-2110. doi: 10.1093/bioinformatics/btac020.
4
MTNeuro: A Benchmark for Evaluating Representations of Brain Structure Across Multiple Levels of Abstraction.MTNeuro:一个用于评估跨多个抽象层次的脑结构表征的基准。
Adv Neural Inf Process Syst. 2022;35:5299-5314.
5
CLIN-X: pre-trained language models and a study on cross-task transfer for concept extraction in the clinical domain.CLIN-X:用于临床领域概念提取的预训练语言模型和跨任务迁移研究。
Bioinformatics. 2022 Jun 13;38(12):3267-3274. doi: 10.1093/bioinformatics/btac297.
6
Cross-Modal self-supervised vision language pre-training with multiple objectives for medical visual question answering.用于医学视觉问答的多目标跨模态自监督视觉语言预训练
J Biomed Inform. 2024 Dec;160:104748. doi: 10.1016/j.jbi.2024.104748. Epub 2024 Nov 12.
7
Protein language models meet reduced amino acid alphabets.蛋白质语言模型与简化的氨基酸字母表相遇。
Bioinformatics. 2024 Feb 1;40(2). doi: 10.1093/bioinformatics/btae061.
8
Self-supervised learning for chest computed tomography: training strategies and effect on downstream applications.胸部计算机断层扫描的自监督学习:训练策略及其对下游应用的影响。
J Med Imaging (Bellingham). 2024 Nov;11(6):064003. doi: 10.1117/1.JMI.11.6.064003. Epub 2024 Nov 9.
9
Simple, Efficient, and Scalable Structure-Aware Adapter Boosts Protein Language Models.简单、高效、可扩展的结构感知适配器提升蛋白质语言模型。
J Chem Inf Model. 2024 Aug 26;64(16):6338-6349. doi: 10.1021/acs.jcim.4c00689. Epub 2024 Aug 7.
10
Self-supervised driven consistency training for annotation efficient histopathology image analysis.用于高效标注组织病理学图像分析的自监督驱动一致性训练
Med Image Anal. 2022 Jan;75:102256. doi: 10.1016/j.media.2021.102256. Epub 2021 Oct 13.

引用本文的文献

1
Using Large Language Models to Enhance Exercise Recommendations and Physical Activity in Clinical and Healthy Populations: Scoping Review.利用大语言模型增强临床和健康人群的运动建议及身体活动:范围综述
JMIR Med Inform. 2025 May 27;13:e59309. doi: 10.2196/59309.
2
Large-Scale Multi-omic Biosequence Transformers for Modeling Protein-Nucleic Acid Interactions.用于蛋白质 - 核酸相互作用建模的大规模多组学生物序列变换器
ArXiv. 2025 Apr 1:arXiv:2408.16245v3.
3
Protein sequence modelling with Bayesian flow networks.基于贝叶斯流网络的蛋白质序列建模。

本文引用的文献

1
Fine-tuning large neural language models for biomedical natural language processing.针对生物医学自然语言处理对大型神经语言模型进行微调。
Patterns (N Y). 2023 Apr 14;4(4):100729. doi: 10.1016/j.patter.2023.100729.
2
How sticky are our proteins? Quantifying hydrophobicity of the human proteome.我们的蛋白质有多黏?量化人类蛋白质组的疏水性。
Bioinform Adv. 2022 Jan 25;2(1):vbac002. doi: 10.1093/bioadv/vbac002. eCollection 2022.
3
Contrastive learning on protein embeddings enlightens midnight zone.蛋白质嵌入的对比学习照亮了午夜区。
Nat Commun. 2025 Apr 3;16(1):3197. doi: 10.1038/s41467-025-58250-2.
4
PIPENN-EMB ensemble net and protein embeddings generalise protein interface prediction beyond homology.PIPENN-EMB集成网络和蛋白质嵌入技术将蛋白质界面预测推广到同源性之外。
Sci Rep. 2025 Feb 5;15(1):4391. doi: 10.1038/s41598-025-88445-y.
5
Protein engineering in the deep learning era.深度学习时代的蛋白质工程。
mLife. 2024 Dec 26;3(4):477-491. doi: 10.1002/mlf2.12157. eCollection 2024 Dec.
6
PETA: evaluating the impact of protein transfer learning with sub-word tokenization on downstream applications.PETA:评估基于子词标记化的蛋白质迁移学习对下游应用的影响。
J Cheminform. 2024 Aug 2;16(1):92. doi: 10.1186/s13321-024-00884-3.
7
Evaluating Representation Learning on the Protein Structure Universe.评估蛋白质结构全域上的表征学习
ArXiv. 2024 Jun 19:arXiv:2406.13864v1.
8
Evaluating generalizability of artificial intelligence models for molecular datasets.评估人工智能模型对分子数据集的可推广性。
bioRxiv. 2024 Feb 28:2024.02.25.581982. doi: 10.1101/2024.02.25.581982.
9
ProteinGym: Large-Scale Benchmarks for Protein Design and Fitness Prediction.蛋白质健身房:蛋白质设计与适应性预测的大规模基准测试
bioRxiv. 2023 Dec 8:2023.12.07.570727. doi: 10.1101/2023.12.07.570727.
10
Ten quick tips for sequence-based prediction of protein properties using machine learning.使用机器学习进行基于序列的蛋白质性质预测的十个快速技巧。
PLoS Comput Biol. 2022 Dec 1;18(12):e1010669. doi: 10.1371/journal.pcbi.1010669. eCollection 2022 Dec.
NAR Genom Bioinform. 2022 Jun 11;4(2):lqac043. doi: 10.1093/nargab/lqac043. eCollection 2022 Jun.
4
Learning meaningful representations of protein sequences.学习蛋白质序列有意义的表示方法。
Nat Commun. 2022 Apr 8;13(1):1914. doi: 10.1038/s41467-022-29443-w.
5
PIPENN: protein interface prediction from sequence with an ensemble of neural nets.PIPENN:利用神经网络集成从序列预测蛋白质界面
Bioinformatics. 2022 Apr 12;38(8):2111-2118. doi: 10.1093/bioinformatics/btac071.
6
The impact of AlphaFold2 one year on.AlphaFold2发布一年后的影响。 (原英文表述不太准确,推测完整意思可能是这样,根据准确英文原文调整翻译会更准确)
Nat Methods. 2022 Jan;19(1):15-20. doi: 10.1038/s41592-021-01365-3.
7
Deep graph learning of inter-protein contacts.蛋白质间接触的深度图学习。
Bioinformatics. 2022 Jan 27;38(4):947-953. doi: 10.1093/bioinformatics/btab761.
8
Improved Protein Structure Prediction Using a New Multi-Scale Network and Homologous Templates.利用新的多尺度网络和同源模板改进蛋白质结构预测。
Adv Sci (Weinh). 2021 Dec;8(24):e2102592. doi: 10.1002/advs.202102592. Epub 2021 Oct 31.
9
Highly accurate protein structure prediction for the human proteome.高精准度的人类蛋白质组蛋白结构预测。
Nature. 2021 Aug;596(7873):590-596. doi: 10.1038/s41586-021-03828-1. Epub 2021 Jul 22.
10
Highly accurate protein structure prediction with AlphaFold.利用 AlphaFold 进行高精度蛋白质结构预测。
Nature. 2021 Aug;596(7873):583-589. doi: 10.1038/s41586-021-03819-2. Epub 2021 Jul 15.