• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

整体编码蛋白质特性可以丰富蛋白质语言模型。

Collectively encoding protein properties enriches protein language models.

机构信息

School of Life Sciences, Northeast Agricultural University, Harbin, 150030, China.

State Key Laboratory of Membrane Biology, Institute of Zoology, Chinese Academy of Sciences, Beijing, 100101, China.

出版信息

BMC Bioinformatics. 2022 Nov 8;23(1):467. doi: 10.1186/s12859-022-05031-z.

DOI:10.1186/s12859-022-05031-z
PMID:36348281
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC9641823/
Abstract

Pre-trained natural language processing models on a large natural language corpus can naturally transfer learned knowledge to protein domains by fine-tuning specific in-domain tasks. However, few studies focused on enriching such protein language models by jointly learning protein properties from strongly-correlated protein tasks. Here we elaborately designed a multi-task learning (MTL) architecture, aiming to decipher implicit structural and evolutionary information from three sequence-level classification tasks for protein family, superfamily and fold. Considering the co-existing contextual relevance between human words and protein language, we employed BERT, pre-trained on a large natural language corpus, as our backbone to handle protein sequences. More importantly, the encoded knowledge obtained in the MTL stage can be well transferred to more fine-grained downstream tasks of TAPE. Experiments on structure- or evolution-related applications demonstrate that our approach outperforms many state-of-the-art Transformer-based protein models, especially in remote homology detection.

摘要

在大型自然语言语料库上进行预训练的自然语言处理模型可以通过微调特定的领域任务,自然地将学到的知识转移到蛋白质域。然而,很少有研究关注通过从强相关的蛋白质任务中联合学习蛋白质特性来丰富这种蛋白质语言模型。在这里,我们精心设计了一种多任务学习 (MTL) 架构,旨在从三个序列级分类任务(蛋白质家族、超家族和折叠)中解析隐含的结构和进化信息。考虑到人类单词和蛋白质语言之间存在共存的上下文相关性,我们采用了在大型自然语言语料库上进行预训练的 BERT 作为我们的骨干来处理蛋白质序列。更重要的是,在 MTL 阶段获得的编码知识可以很好地转移到 TAPE 的更精细的下游任务中。在与结构或进化相关的应用程序上的实验表明,我们的方法优于许多最先进的基于 Transformer 的蛋白质模型,特别是在远程同源检测方面。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1a23/9641823/b0662d02464e/12859_2022_5031_Fig6_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1a23/9641823/778f3eac1e05/12859_2022_5031_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1a23/9641823/52067759860c/12859_2022_5031_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1a23/9641823/baf9ad9346e5/12859_2022_5031_Fig3_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1a23/9641823/28586728a54e/12859_2022_5031_Fig4_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1a23/9641823/c798609cc7e6/12859_2022_5031_Fig5_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1a23/9641823/b0662d02464e/12859_2022_5031_Fig6_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1a23/9641823/778f3eac1e05/12859_2022_5031_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1a23/9641823/52067759860c/12859_2022_5031_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1a23/9641823/baf9ad9346e5/12859_2022_5031_Fig3_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1a23/9641823/28586728a54e/12859_2022_5031_Fig4_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1a23/9641823/c798609cc7e6/12859_2022_5031_Fig5_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1a23/9641823/b0662d02464e/12859_2022_5031_Fig6_HTML.jpg

相似文献

1
Collectively encoding protein properties enriches protein language models.整体编码蛋白质特性可以丰富蛋白质语言模型。
BMC Bioinformatics. 2022 Nov 8;23(1):467. doi: 10.1186/s12859-022-05031-z.
2
Drug knowledge discovery via multi-task learning and pre-trained models.通过多任务学习和预训练模型进行药物知识发现。
BMC Med Inform Decis Mak. 2021 Nov 16;21(Suppl 9):251. doi: 10.1186/s12911-021-01614-7.
3
When BERT meets Bilbo: a learning curve analysis of pretrained language model on disease classification.当 BERT 遇见比尔博:预训练语言模型在疾病分类上的学习曲线分析。
BMC Med Inform Decis Mak. 2022 Apr 5;21(Suppl 9):377. doi: 10.1186/s12911-022-01829-2.
4
A Question-and-Answer System to Extract Data From Free-Text Oncological Pathology Reports (CancerBERT Network): Development Study.从自由文本肿瘤病理学报告(CancerBERT 网络)中提取数据的问答系统:开发研究。
J Med Internet Res. 2022 Mar 23;24(3):e27210. doi: 10.2196/27210.
5
BioBERT and Similar Approaches for Relation Extraction.BioBERT 及其在关系抽取中的应用。
Methods Mol Biol. 2022;2496:221-235. doi: 10.1007/978-1-0716-2305-3_12.
6
Confounder balancing in adversarial domain adaptation for pre-trained large models fine-tuning.对抗域自适应中预训练大模型微调的混杂因素平衡。
Neural Netw. 2024 May;173:106173. doi: 10.1016/j.neunet.2024.106173. Epub 2024 Feb 10.
7
Transformers-sklearn: a toolkit for medical language understanding with transformer-based models.Transformer-sklearn:一个基于 Transformer 的模型的医学语言理解工具包。
BMC Med Inform Decis Mak. 2021 Jul 30;21(Suppl 2):90. doi: 10.1186/s12911-021-01459-0.
8
Modeling aspects of the language of life through transfer-learning protein sequences.通过转移学习蛋白质序列来模拟生命语言的各个方面。
BMC Bioinformatics. 2019 Dec 17;20(1):723. doi: 10.1186/s12859-019-3220-8.
9
An analysis of protein language model embeddings for fold prediction.蛋白质语言模型嵌入物折叠预测分析。
Brief Bioinform. 2022 May 13;23(3). doi: 10.1093/bib/bbac142.
10
CLIN-X: pre-trained language models and a study on cross-task transfer for concept extraction in the clinical domain.CLIN-X:用于临床领域概念提取的预训练语言模型和跨任务迁移研究。
Bioinformatics. 2022 Jun 13;38(12):3267-3274. doi: 10.1093/bioinformatics/btac297.

引用本文的文献

1
Protein Sequence Analysis landscape: A Systematic Review of Task Types, Databases, Datasets, Word Embeddings Methods, and Language Models.蛋白质序列分析全景:任务类型、数据库、数据集、词嵌入方法和语言模型的系统综述
Database (Oxford). 2025 May 30;2025. doi: 10.1093/database/baaf027.
2
pACP-HybDeep: predicting anticancer peptides using binary tree growth based transformer and structural feature encoding with deep-hybrid learning.pACP-HybDeep:基于二叉树生长的变压器和深度混合学习的结构特征编码预测抗癌肽
Sci Rep. 2025 Jan 2;15(1):565. doi: 10.1038/s41598-024-84146-0.

本文引用的文献

1
ProteinGLUE multi-task benchmark suite for self-supervised protein modeling.蛋白质 GLUE 多任务基准套件,用于自监督蛋白质建模。
Sci Rep. 2022 Sep 26;12(1):16047. doi: 10.1038/s41598-022-19608-4.
2
Multi-task learning to leverage partially annotated data for PPI interface prediction.多任务学习利用部分注释数据进行 PPI 界面预测。
Sci Rep. 2022 Jun 21;12(1):10487. doi: 10.1038/s41598-022-13951-2.
3
An analysis of protein language model embeddings for fold prediction.蛋白质语言模型嵌入物折叠预测分析。
Brief Bioinform. 2022 May 13;23(3). doi: 10.1093/bib/bbac142.
4
ProteinBERT: a universal deep-learning model of protein sequence and function.蛋白质 BERT:一种通用的蛋白质序列和功能深度学习模型。
Bioinformatics. 2022 Apr 12;38(8):2102-2110. doi: 10.1093/bioinformatics/btac020.
5
BERT-Kcr: prediction of lysine crotonylation sites by a transfer learning method with pre-trained BERT models.BERT-Kcr:基于预训练BERT模型的迁移学习方法预测赖氨酸巴豆酰化位点
Bioinformatics. 2022 Jan 12;38(3):648-654. doi: 10.1093/bioinformatics/btab712.
6
Highly accurate protein structure prediction with AlphaFold.利用 AlphaFold 进行高精度蛋白质结构预测。
Nature. 2021 Aug;596(7873):583-589. doi: 10.1038/s41586-021-03819-2. Epub 2021 Jul 15.
7
Learning the protein language: Evolution, structure, and function.学习蛋白质语言:进化、结构和功能。
Cell Syst. 2021 Jun 16;12(6):654-669.e3. doi: 10.1016/j.cels.2021.05.017.
8
A novel antibacterial peptide recognition algorithm based on BERT.基于 BERT 的新型抗菌肽识别算法。
Brief Bioinform. 2021 Nov 5;22(6). doi: 10.1093/bib/bbab200.
9
ProteinTools: a toolkit to analyze protein structures.蛋白质工具包:用于分析蛋白质结构的工具包。
Nucleic Acids Res. 2021 Jul 2;49(W1):W559-W566. doi: 10.1093/nar/gkab375.
10
Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences.生物结构和功能源于将无监督学习扩展到 2.5 亿个蛋白质序列。
Proc Natl Acad Sci U S A. 2021 Apr 13;118(15). doi: 10.1073/pnas.2016239118.