• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

通用情境化蛋白质嵌入在跨物种蛋白质功能预测中的作用

The Power of Universal Contextualized Protein Embeddings in Cross-species Protein Function Prediction.

作者信息

van den Bent Irene, Makrodimitris Stavros, Reinders Marcel

机构信息

Delft Bioinformatics Lab, Delft University of Technology, Delft, the Netherlands.

Keygene N.V., Wageningen, the Netherlands.

出版信息

Evol Bioinform Online. 2021 Dec 3;17:11769343211062608. doi: 10.1177/11769343211062608. eCollection 2021.

DOI:10.1177/11769343211062608
PMID:34880594
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC8647222/
Abstract

Computationally annotating proteins with a molecular function is a difficult problem that is made even harder due to the limited amount of available labeled protein training data. Unsupervised protein embeddings partly circumvent this limitation by learning a universal protein representation from many unlabeled sequences. Such embeddings incorporate contextual information of amino acids, thereby modeling the underlying principles of protein sequences insensitive to the context of species. We used an existing pre-trained protein embedding method and subjected its molecular function prediction performance to detailed characterization, first to advance the understanding of protein language models, and second to determine areas of improvement. Then, we applied the model in a transfer learning task by training a function predictor based on the embeddings of annotated protein sequences of one training species and making predictions on the proteins of several test species with varying evolutionary distance. We show that this approach successfully generalizes knowledge about protein function from one eukaryotic species to various other species, outperforming both an alignment-based and a supervised-learning-based baseline. This implies that such a method could be effective for molecular function prediction in inadequately annotated species from understudied taxonomic kingdoms.

摘要

通过计算为蛋白质标注分子功能是一个难题,由于可用的带标签蛋白质训练数据量有限,这个问题变得更加困难。无监督蛋白质嵌入通过从许多未标记序列中学习通用蛋白质表示,部分规避了这一限制。此类嵌入纳入了氨基酸的上下文信息,从而对蛋白质序列的潜在原理进行建模,而不受物种上下文的影响。我们使用了一种现有的预训练蛋白质嵌入方法,并对其分子功能预测性能进行了详细表征,一是为了增进对蛋白质语言模型的理解,二是为了确定改进的方向。然后,我们在一个迁移学习任务中应用该模型,方法是基于一个训练物种的带注释蛋白质序列的嵌入训练一个功能预测器,并对具有不同进化距离的几个测试物种的蛋白质进行预测。我们表明,这种方法成功地将关于蛋白质功能的知识从一个真核物种推广到其他各种物种,优于基于比对和基于监督学习的基线方法。这意味着这种方法对于来自研究不足的分类界中注释不充分的物种的分子功能预测可能是有效的。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/99a6/8647222/93c8bde9172d/10.1177_11769343211062608-fig7.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/99a6/8647222/1f669c193855/10.1177_11769343211062608-fig1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/99a6/8647222/f3faffec8e34/10.1177_11769343211062608-fig2.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/99a6/8647222/480802b4e036/10.1177_11769343211062608-fig3.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/99a6/8647222/cd2429946bf7/10.1177_11769343211062608-fig4.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/99a6/8647222/32d516a2d2dc/10.1177_11769343211062608-fig5.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/99a6/8647222/8a348638985c/10.1177_11769343211062608-fig6.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/99a6/8647222/93c8bde9172d/10.1177_11769343211062608-fig7.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/99a6/8647222/1f669c193855/10.1177_11769343211062608-fig1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/99a6/8647222/f3faffec8e34/10.1177_11769343211062608-fig2.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/99a6/8647222/480802b4e036/10.1177_11769343211062608-fig3.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/99a6/8647222/cd2429946bf7/10.1177_11769343211062608-fig4.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/99a6/8647222/32d516a2d2dc/10.1177_11769343211062608-fig5.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/99a6/8647222/8a348638985c/10.1177_11769343211062608-fig6.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/99a6/8647222/93c8bde9172d/10.1177_11769343211062608-fig7.jpg

相似文献

1
The Power of Universal Contextualized Protein Embeddings in Cross-species Protein Function Prediction.通用情境化蛋白质嵌入在跨物种蛋白质功能预测中的作用
Evol Bioinform Online. 2021 Dec 3;17:11769343211062608. doi: 10.1177/11769343211062608. eCollection 2021.
2
Modeling aspects of the language of life through transfer-learning protein sequences.通过转移学习蛋白质序列来模拟生命语言的各个方面。
BMC Bioinformatics. 2019 Dec 17;20(1):723. doi: 10.1186/s12859-019-3220-8.
3
SPRoBERTa: protein embedding learning with local fragment modeling.SPRoBERTa:基于局部片段建模的蛋白质嵌入学习。
Brief Bioinform. 2022 Nov 19;23(6). doi: 10.1093/bib/bbac401.
4
Unsupervised protein embeddings outperform hand-crafted sequence and structure features at predicting molecular function.无监督蛋白质嵌入在预测分子功能方面优于手工制作的序列和结构特征。
Bioinformatics. 2021 Apr 19;37(2):162-170. doi: 10.1093/bioinformatics/btaa701.
5
16S rRNA sequence embeddings: Meaningful numeric feature representations of nucleotide sequences that are convenient for downstream analyses.16S rRNA 序列嵌入:核苷酸序列有意义的数值特征表示形式,方便下游分析。
PLoS Comput Biol. 2019 Feb 26;15(2):e1006721. doi: 10.1371/journal.pcbi.1006721. eCollection 2019 Feb.
6
Combining Contextualized Embeddings and Prior Knowledge for Clinical Named Entity Recognition: Evaluation Study.结合上下文嵌入和先验知识进行临床命名实体识别:评估研究
JMIR Med Inform. 2019 Nov 13;7(4):e14850. doi: 10.2196/14850.
7
An analysis of protein language model embeddings for fold prediction.蛋白质语言模型嵌入物折叠预测分析。
Brief Bioinform. 2022 May 13;23(3). doi: 10.1093/bib/bbac142.
8
When Protein Structure Embedding Meets Large Language Models.当蛋白质结构嵌入与大型语言模型相遇时。
Genes (Basel). 2023 Dec 23;15(1):25. doi: 10.3390/genes15010025.
9
Morphology-aware multi-source fusion-based intracranial aneurysms rupture prediction.基于形态感知的多源融合的颅内动脉瘤破裂预测。
Eur Radiol. 2022 Aug;32(8):5633-5641. doi: 10.1007/s00330-022-08608-7. Epub 2022 Feb 18.
10
Unsupervised online multitask learning of behavioral sentence embeddings.行为句子嵌入的无监督在线多任务学习。
PeerJ Comput Sci. 2019 Jun 10;5:e200. doi: 10.7717/peerj-cs.200. eCollection 2019.

引用本文的文献

1
Integrating Embeddings from Multiple Protein Language Models to Improve Protein -GlcNAc Site Prediction.整合来自多个蛋白质语言模型的嵌入以提高蛋白质-GlcNAc 位点预测。
Int J Mol Sci. 2023 Nov 6;24(21):16000. doi: 10.3390/ijms242116000.
2
Two sequence- and two structure-based ML models have learned different aspects of protein biochemistry.两种基于序列和结构的 ML 模型已经学习了蛋白质生物化学的不同方面。
Sci Rep. 2023 Aug 16;13(1):13280. doi: 10.1038/s41598-023-40247-w.
3
SAP: Synteny-aware gene function prediction for bacteria using protein embeddings.

本文引用的文献

1
DeepGOPlus: improved protein function prediction from sequence.DeepGOPlus:基于序列改进蛋白质功能预测
Bioinformatics. 2021 May 23;37(8):1187. doi: 10.1093/bioinformatics/btaa763.
2
Interactive Tree Of Life (iTOL) v5: an online tool for phylogenetic tree display and annotation.交互式生命树 (iTOL) v5:一个用于显示和注释系统发育树的在线工具。
Nucleic Acids Res. 2021 Jul 2;49(W1):W293-W296. doi: 10.1093/nar/gkab301.
3
Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences.
SAP:利用蛋白质嵌入对细菌进行共线性感知基因功能预测。
bioRxiv. 2023 Nov 21:2023.05.02.539034. doi: 10.1101/2023.05.02.539034.
4
Two sequence- and two structure-based ML models have learned different aspects of protein biochemistry.两个基于序列和两个基于结构的机器学习模型已经了解了蛋白质生物化学的不同方面。
bioRxiv. 2023 Jul 9:2023.03.20.533508. doi: 10.1101/2023.03.20.533508.
生物结构和功能源于将无监督学习扩展到 2.5 亿个蛋白质序列。
Proc Natl Acad Sci U S A. 2021 Apr 13;118(15). doi: 10.1073/pnas.2016239118.
4
Embeddings from deep learning transfer GO annotations beyond homology.深度学习的嵌入信息可以将 GO 注释扩展到同源之外。
Sci Rep. 2021 Jan 13;11(1):1160. doi: 10.1038/s41598-020-80786-0.
5
Evaluating Protein Transfer Learning with TAPE.使用TAPE评估蛋白质迁移学习。
Adv Neural Inf Process Syst. 2019 Dec;32:9689-9701.
6
Unsupervised protein embeddings outperform hand-crafted sequence and structure features at predicting molecular function.无监督蛋白质嵌入在预测分子功能方面优于手工制作的序列和结构特征。
Bioinformatics. 2021 Apr 19;37(2):162-170. doi: 10.1093/bioinformatics/btaa701.
7
The proteome landscape of the kingdoms of life.生命王国的蛋白质组全景
Nature. 2020 Jun;582(7813):592-596. doi: 10.1038/s41586-020-2402-x. Epub 2020 Jun 17.
8
Detecting Gene Ontology misannotations using taxon-specific rate ratio comparisons.利用分类群特异性比率比较检测基因本体论错误注释。
Bioinformatics. 2020 Aug 15;36(16):4383-4388. doi: 10.1093/bioinformatics/btaa548.
9
Modeling aspects of the language of life through transfer-learning protein sequences.通过转移学习蛋白质序列来模拟生命语言的各个方面。
BMC Bioinformatics. 2019 Dec 17;20(1):723. doi: 10.1186/s12859-019-3220-8.
10
The CAFA challenge reports improved protein function prediction and new functional annotations for hundreds of genes through experimental screens.CAFA 挑战赛报告称,通过实验筛选,提高了数百个基因的蛋白质功能预测和新的功能注释。
Genome Biol. 2019 Nov 19;20(1):244. doi: 10.1186/s13059-019-1835-8.