• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

SPRoBERTa:基于局部片段建模的蛋白质嵌入学习。

SPRoBERTa: protein embedding learning with local fragment modeling.

机构信息

Microsoft Research Asia, No. 5 Dan Ling Street, Haidian District, 100080, Beijing, China.

National Key Laboratory for Novel Software Technology, Nanjing University, 163 Xianlin Road, Qixia District, 210023, Nanjing, Jiangsu Province, China.

出版信息

Brief Bioinform. 2022 Nov 19;23(6). doi: 10.1093/bib/bbac401.

DOI:10.1093/bib/bbac401
PMID:36136367
Abstract

Well understanding protein function and structure in computational biology helps in the understanding of human beings. To face the limited proteins that are annotated structurally and functionally, the scientific community embraces the self-supervised pre-training methods from large amounts of unlabeled protein sequences for protein embedding learning. However, the protein is usually represented by individual amino acids with limited vocabulary size (e.g. 20 type proteins), without considering the strong local semantics existing in protein sequences. In this work, we propose a novel pre-training modeling approach SPRoBERTa. We first present an unsupervised protein tokenizer to learn protein representations with local fragment pattern. Then, a novel framework for deep pre-training model is introduced to learn protein embeddings. After pre-training, our method can be easily fine-tuned for different protein tasks, including amino acid-level prediction task (e.g. secondary structure prediction), amino acid pair-level prediction task (e.g. contact prediction) and also protein-level prediction task (remote homology prediction, protein function prediction). Experiments show that our approach achieves significant improvements in all tasks and outperforms the previous methods. We also provide detailed ablation studies and analysis for our protein tokenizer and training framework.

摘要

在计算生物学中,深入了解蛋白质的功能和结构有助于我们理解人类。为了应对结构和功能上有注释的有限蛋白质,科学界采用了自监督的预训练方法,从大量未标记的蛋白质序列中学习蛋白质嵌入。然而,蛋白质通常由具有有限词汇量的单个氨基酸表示(例如 20 种类型的蛋白质),而没有考虑蛋白质序列中存在的强局部语义。在这项工作中,我们提出了一种新的预训练建模方法 SPRoBERTa。我们首先提出了一种无监督的蛋白质标记器,用于学习具有局部片段模式的蛋白质表示。然后,引入了一种新的深度预训练模型框架,用于学习蛋白质嵌入。预训练后,我们的方法可以轻松地针对不同的蛋白质任务进行微调,包括氨基酸级别的预测任务(例如二级结构预测)、氨基酸对级别的预测任务(例如接触预测)以及蛋白质级别的预测任务(远程同源预测、蛋白质功能预测)。实验表明,我们的方法在所有任务中都取得了显著的改进,优于以前的方法。我们还提供了关于我们的蛋白质标记器和训练框架的详细消融研究和分析。

相似文献

1
SPRoBERTa: protein embedding learning with local fragment modeling.SPRoBERTa:基于局部片段建模的蛋白质嵌入学习。
Brief Bioinform. 2022 Nov 19;23(6). doi: 10.1093/bib/bbac401.
2
Modeling aspects of the language of life through transfer-learning protein sequences.通过转移学习蛋白质序列来模拟生命语言的各个方面。
BMC Bioinformatics. 2019 Dec 17;20(1):723. doi: 10.1186/s12859-019-3220-8.
3
GeneralizedDTA: combining pre-training and multi-task learning to predict drug-target binding affinity for unknown drug discovery.通用 DTA:结合预训练和多任务学习,预测未知药物发现的药物-靶标结合亲和力。
BMC Bioinformatics. 2022 Sep 7;23(1):367. doi: 10.1186/s12859-022-04905-6.
4
TLCrys: Transfer Learning Based Method for Protein Crystallization Prediction.TLCrys:基于迁移学习的蛋白质结晶预测方法。
Int J Mol Sci. 2022 Jan 16;23(2):972. doi: 10.3390/ijms23020972.
5
The Power of Universal Contextualized Protein Embeddings in Cross-species Protein Function Prediction.通用情境化蛋白质嵌入在跨物种蛋白质功能预测中的作用
Evol Bioinform Online. 2021 Dec 3;17:11769343211062608. doi: 10.1177/11769343211062608. eCollection 2021.
6
BERT2DAb: a pre-trained model for antibody representation based on amino acid sequences and 2D-structure.BERT2DAb:基于氨基酸序列和 2D 结构的抗体表示预训练模型。
MAbs. 2023 Jan-Dec;15(1):2285904. doi: 10.1080/19420862.2023.2285904. Epub 2023 Nov 27.
7
Transfer learning in proteins: evaluating novel protein learned representations for bioinformatics tasks.蛋白质中的迁移学习:评估生物信息学任务中新型蛋白质学习表示。
Brief Bioinform. 2022 Jul 18;23(4). doi: 10.1093/bib/bbac232.
8
Incorporating homologues into sequence embeddings for protein analysis.将同源物纳入用于蛋白质分析的序列嵌入中。
J Bioinform Comput Biol. 2007 Jun;5(3):717-38. doi: 10.1142/s0219720007002734.
9
Probabilistic variable-length segmentation of protein sequences for discriminative motif discovery (DiMotif) and sequence embedding (ProtVecX).蛋白质序列的概率可变长度分割用于判别基序发现 (DiMotif) 和序列嵌入 (ProtVecX)。
Sci Rep. 2019 Mar 5;9(1):3577. doi: 10.1038/s41598-019-38746-w.
10
A unified multitask architecture for predicting local protein properties.一种用于预测局部蛋白质性质的统一多任务架构。
PLoS One. 2012;7(3):e32235. doi: 10.1371/journal.pone.0032235. Epub 2012 Mar 26.

引用本文的文献

1
Advancing bioinformatics with large language models: components, applications and perspectives.利用大语言模型推进生物信息学:组件、应用与展望
ArXiv. 2025 Jan 31:arXiv:2401.04155v2.