• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

使用TAPE评估蛋白质迁移学习。

Evaluating Protein Transfer Learning with TAPE.

作者信息

Rao Roshan, Bhattacharya Nicholas, Thomas Neil, Duan Yan, Chen Xi, Canny John, Abbeel Pieter, Song Yun S

机构信息

UC Berkeley.

covariant.ai.

出版信息

Adv Neural Inf Process Syst. 2019 Dec;32:9689-9701.

PMID:33390682
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC7774645/
Abstract

Machine learning applied to protein sequences is an increasingly popular area of research. Semi-supervised learning for proteins has emerged as an important paradigm due to the high cost of acquiring supervised protein labels, but the current literature is fragmented when it comes to datasets and standardized evaluation techniques. To facilitate progress in this field, we introduce the Tasks Assessing Protein Embeddings (TAPE), a set of five biologically relevant semi-supervised learning tasks spread across different domains of protein biology. We curate tasks into specific training, validation, and test splits to ensure that each task tests biologically relevant generalization that transfers to real-life scenarios. We benchmark a range of approaches to semi-supervised protein representation learning, which span recent work as well as canonical sequence learning techniques. We find that self-supervised pretraining is helpful for almost all models on all tasks, more than doubling performance in some cases. Despite this increase, in several cases features learned by self-supervised pretraining still lag behind features extracted by state-of-the-art non-neural techniques. This gap in performance suggests a huge opportunity for innovative architecture design and improved modeling paradigms that better capture the signal in biological sequences. TAPE will help the machine learning community focus effort on scientifically relevant problems. Toward this end, all data and code used to run these experiments are available at https://github.com/songlab-cal/tape.

摘要

将机器学习应用于蛋白质序列是一个日益热门的研究领域。由于获取有监督蛋白质标签的成本高昂,蛋白质的半监督学习已成为一种重要的范式,但目前关于数据集和标准化评估技术的文献较为零散。为推动该领域的发展,我们引入了蛋白质嵌入评估任务(TAPE),这是一组分布在蛋白质生物学不同领域的五个与生物学相关的半监督学习任务。我们将任务整理成特定的训练、验证和测试集,以确保每个任务都能测试可迁移到实际场景的生物学相关泛化能力。我们对一系列半监督蛋白质表示学习方法进行了基准测试,这些方法涵盖了近期的工作以及经典的序列学习技术。我们发现自监督预训练对所有任务中的几乎所有模型都有帮助,在某些情况下性能提升了一倍多。尽管有这种提升,但在一些情况下,自监督预训练学习到的特征仍落后于最先进的非神经技术提取的特征。这种性能差距为创新架构设计和改进建模范式提供了巨大机遇,以便更好地捕捉生物序列中的信号。TAPE将帮助机器学习社区将精力集中在与科学相关的问题上。为此,运行这些实验所使用的所有数据和代码可在https://github.com/songlab-cal/tape获取。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9a67/7774645/907998ba3529/nihms-1646867-f0004.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9a67/7774645/0fbd14f0230d/nihms-1646867-f0001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9a67/7774645/d5b666ca942d/nihms-1646867-f0002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9a67/7774645/f28d18723767/nihms-1646867-f0003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9a67/7774645/907998ba3529/nihms-1646867-f0004.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9a67/7774645/0fbd14f0230d/nihms-1646867-f0001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9a67/7774645/d5b666ca942d/nihms-1646867-f0002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9a67/7774645/f28d18723767/nihms-1646867-f0003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9a67/7774645/907998ba3529/nihms-1646867-f0004.jpg

相似文献

1
Evaluating Protein Transfer Learning with TAPE.使用TAPE评估蛋白质迁移学习。
Adv Neural Inf Process Syst. 2019 Dec;32:9689-9701.
2
Self-supervised driven consistency training for annotation efficient histopathology image analysis.用于高效标注组织病理学图像分析的自监督驱动一致性训练
Med Image Anal. 2022 Jan;75:102256. doi: 10.1016/j.media.2021.102256. Epub 2021 Oct 13.
3
Local contrastive loss with pseudo-label based self-training for semi-supervised medical image segmentation.基于伪标签自训练的局部对比损失的半监督医学图像分割。
Med Image Anal. 2023 Jul;87:102792. doi: 10.1016/j.media.2023.102792. Epub 2023 Mar 11.
4
PolypMixNet: Enhancing semi-supervised polyp segmentation with polyp-aware augmentation.PolypMixNet:利用息肉感知增强进行半监督息肉分割。
Comput Biol Med. 2024 Mar;170:108006. doi: 10.1016/j.compbiomed.2024.108006. Epub 2024 Jan 15.
5
Dissecting self-supervised learning methods for surgical computer vision.剖析手术计算机视觉的自监督学习方法。
Med Image Anal. 2023 Aug;88:102844. doi: 10.1016/j.media.2023.102844. Epub 2023 May 24.
6
Transfer learning in proteins: evaluating novel protein learned representations for bioinformatics tasks.蛋白质中的迁移学习:评估生物信息学任务中新型蛋白质学习表示。
Brief Bioinform. 2022 Jul 18;23(4). doi: 10.1093/bib/bbac232.
7
Multi-Task Collaborative Network: Bridge the Supervised and Self-Supervised Learning for EEG Classification in RSVP Tasks.多任务协作网络:用于 RSVP 任务中 EEG 分类的有监督和自监督学习的桥梁。
IEEE Trans Neural Syst Rehabil Eng. 2024;32:638-651. doi: 10.1109/TNSRE.2024.3357863. Epub 2024 Feb 1.
8
Structure-aware protein self-supervised learning.基于结构的蛋白质自监督学习。
Bioinformatics. 2023 Apr 3;39(4). doi: 10.1093/bioinformatics/btad189.
9
An Efficient Semi-Supervised Framework with Multi-Task and Curriculum Learning for Medical Image Segmentation.一种用于医学图像分割的高效半监督框架,具有多任务和课程学习。
Int J Neural Syst. 2022 Sep;32(9):2250043. doi: 10.1142/S0129065722500435. Epub 2022 Jul 30.
10
Shifting to machine supervision: annotation-efficient semi and self-supervised learning for automatic medical image segmentation and classification.转向机器监督:用于自动医学图像分割和分类的高效半自动和自监督学习。
Sci Rep. 2024 May 11;14(1):10820. doi: 10.1038/s41598-024-61822-9.

引用本文的文献

1
An iterative deep learning-guided algorithm for directed protein evolution.一种用于定向蛋白质进化的迭代深度学习引导算法。
iScience. 2025 Aug 7;28(9):113324. doi: 10.1016/j.isci.2025.113324. eCollection 2025 Sep 19.
2
Graph neural network integrated with pretrained protein language model for predicting human-virus protein-protein interactions.结合预训练蛋白质语言模型的图神经网络用于预测人-病毒蛋白质-蛋白质相互作用
Brief Bioinform. 2025 Aug 31;26(5). doi: 10.1093/bib/bbaf461.
3
Protein functional site annotation using local structure embeddings.

本文引用的文献

1
Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences.生物结构和功能源于将无监督学习扩展到 2.5 亿个蛋白质序列。
Proc Natl Acad Sci U S A. 2021 Apr 13;118(15). doi: 10.1073/pnas.2016239118.
2
Modeling aspects of the language of life through transfer-learning protein sequences.通过转移学习蛋白质序列来模拟生命语言的各个方面。
BMC Bioinformatics. 2019 Dec 17;20(1):723. doi: 10.1186/s12859-019-3220-8.
3
Unified rational protein engineering with sequence-based deep representation learning.
利用局部结构嵌入进行蛋白质功能位点注释。
Proc Natl Acad Sci U S A. 2025 Aug 26;122(34):e2513219122. doi: 10.1073/pnas.2513219122. Epub 2025 Aug 20.
4
Accurate Prediction of Protein Tertiary and Quaternary Stability Using Fine-Tuned Protein Language Models and Free Energy Perturbation.使用微调蛋白质语言模型和自由能微扰准确预测蛋白质三级和四级结构稳定性
Int J Mol Sci. 2025 Jul 24;26(15):7125. doi: 10.3390/ijms26157125.
5
Bag-of-words is competitive with sum-of-embeddings language-inspired representations on protein inference.词袋模型在蛋白质推理方面与基于语言启发的词嵌入求和表示法具有竞争力。
PLoS One. 2025 Aug 6;20(8):e0325531. doi: 10.1371/journal.pone.0325531. eCollection 2025.
6
Assessing generative model coverage of protein structures with SHAPES.使用SHAPES评估蛋白质结构的生成模型覆盖率。
Cell Syst. 2025 Jul 23:101347. doi: 10.1016/j.cels.2025.101347.
7
PLMFit: benchmarking transfer learning with protein language models for protein engineering.PLMFit:使用蛋白质语言模型进行蛋白质工程的迁移学习基准测试
Brief Bioinform. 2025 Jul 2;26(4). doi: 10.1093/bib/bbaf381.
8
Safe model based optimization balancing exploration and reliability for protein sequence design.基于安全模型的优化:平衡蛋白质序列设计中的探索与可靠性
Sci Rep. 2025 Jul 29;15(1):27568. doi: 10.1038/s41598-025-12568-5.
9
In silico prediction of variant effects: promises and limitations for precision plant breeding.变异效应的计算机模拟预测:精准植物育种的前景与局限
Theor Appl Genet. 2025 Jul 28;138(8):193. doi: 10.1007/s00122-025-04973-1.
10
A resource of RNA-binding protein motifs across eukaryotes reveals evolutionary dynamics and gene-regulatory function.一项针对真核生物中RNA结合蛋白基序的资源揭示了进化动态和基因调控功能。
Nat Biotechnol. 2025 Jul 25. doi: 10.1038/s41587-025-02733-6.
基于序列的深度学习表示的统一理性蛋白质工程。
Nat Methods. 2019 Dec;16(12):1315-1322. doi: 10.1038/s41592-019-0598-1. Epub 2019 Oct 21.
4
ProteinNet: a standardized data set for machine learning of protein structure.ProteinNet:用于蛋白质结构机器学习的标准化数据集。
BMC Bioinformatics. 2019 Jun 11;20(1):311. doi: 10.1186/s12859-019-2932-0.
5
NetSurfP-2.0: Improved prediction of protein structural features by integrated deep learning.NetSurfP-2.0:通过集成深度学习改进蛋白质结构特征预测。
Proteins. 2019 Jun;87(6):520-527. doi: 10.1002/prot.25674. Epub 2019 Mar 9.
6
CasX enzymes comprise a distinct family of RNA-guided genome editors.CasX 酶构成了一个独特的 RNA 引导的基因组编辑酶家族。
Nature. 2019 Feb;566(7743):218-223. doi: 10.1038/s41586-019-0908-x. Epub 2019 Feb 4.
7
UniProt: a worldwide hub of protein knowledge.UniProt:蛋白质知识的全球枢纽。
Nucleic Acids Res. 2019 Jan 8;47(D1):D506-D515. doi: 10.1093/nar/gky1049.
8
The Pfam protein families database in 2019.2019 年 Pfam 蛋白质家族数据库。
Nucleic Acids Res. 2019 Jan 8;47(D1):D427-D432. doi: 10.1093/nar/gky995.
9
Deep generative models of genetic variation capture the effects of mutations.深度生成模型捕获遗传变异的突变效应。
Nat Methods. 2018 Oct;15(10):816-822. doi: 10.1038/s41592-018-0138-4. Epub 2018 Sep 24.
10
Learned protein embeddings for machine learning.机器学习的深度学习蛋白质嵌入。
Bioinformatics. 2018 Aug 1;34(15):2642-2648. doi: 10.1093/bioinformatics/bty178.