• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

利用语言模型学习蛋白质的序列、结构和功能表示。

Learning sequence, structure, and function representations of proteins with language models.

作者信息

Hamamsy Tymor, Barot Meet, Morton James T, Steinegger Martin, Bonneau Richard, Cho Kyunghyun

机构信息

Center for Data Science, New York University, New York, NY, USA.

Mythos Scientific, NJ, USA.

出版信息

bioRxiv. 2023 Nov 26:2023.11.26.568742. doi: 10.1101/2023.11.26.568742.

DOI:10.1101/2023.11.26.568742
PMID:38045331
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC10690258/
Abstract

The sequence-structure-function relationships that ultimately generate the diversity of extant observed proteins is complex, as proteins bridge the gap between multiple informational and physical scales involved in nearly all cellular processes. One limitation of existing protein annotation databases such as UniProt is that less than 1% of proteins have experimentally verified functions, and computational methods are needed to fill in the missing information. Here, we demonstrate that a multi-aspect framework based on protein language models can learn sequence-structure-function representations of amino acid sequences, and can provide the foundation for sensitive sequence-structure-function aware protein sequence search and annotation. Based on this model, we introduce a multi-aspect information retrieval system for proteins, Protein-Vec, covering sequence, structure, and function aspects, that enables computational protein annotation and function prediction at tree-of-life scales.

摘要

最终产生现存已观察到的蛋白质多样性的序列-结构-功能关系是复杂的,因为蛋白质跨越了几乎所有细胞过程中涉及的多个信息和物理尺度之间的差距。现有蛋白质注释数据库(如UniProt)的一个局限性在于,只有不到1%的蛋白质具有经实验验证的功能,因此需要计算方法来填补缺失的信息。在这里,我们证明了基于蛋白质语言模型的多方面框架可以学习氨基酸序列的序列-结构-功能表示,并可为灵敏的序列-结构-功能感知蛋白质序列搜索和注释提供基础。基于此模型,我们引入了一个用于蛋白质的多方面信息检索系统Protein-Vec,它涵盖序列、结构和功能方面,能够在生命树尺度上进行计算蛋白质注释和功能预测。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/8ff8/10690258/9d06f24caf42/nihpp-2023.11.26.568742v1-f0005.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/8ff8/10690258/7f4cf64bd9ce/nihpp-2023.11.26.568742v1-f0001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/8ff8/10690258/fbc90423cbb4/nihpp-2023.11.26.568742v1-f0002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/8ff8/10690258/2e86de14328f/nihpp-2023.11.26.568742v1-f0003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/8ff8/10690258/8f6542539a12/nihpp-2023.11.26.568742v1-f0004.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/8ff8/10690258/9d06f24caf42/nihpp-2023.11.26.568742v1-f0005.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/8ff8/10690258/7f4cf64bd9ce/nihpp-2023.11.26.568742v1-f0001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/8ff8/10690258/fbc90423cbb4/nihpp-2023.11.26.568742v1-f0002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/8ff8/10690258/2e86de14328f/nihpp-2023.11.26.568742v1-f0003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/8ff8/10690258/8f6542539a12/nihpp-2023.11.26.568742v1-f0004.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/8ff8/10690258/9d06f24caf42/nihpp-2023.11.26.568742v1-f0005.jpg

相似文献

1
Learning sequence, structure, and function representations of proteins with language models.利用语言模型学习蛋白质的序列、结构和功能表示。
bioRxiv. 2023 Nov 26:2023.11.26.568742. doi: 10.1101/2023.11.26.568742.
2
Improving protein function prediction by learning and integrating representations of protein sequences and function labels.通过学习和整合蛋白质序列及功能标签的表示来改进蛋白质功能预测。
Bioinform Adv. 2024 Aug 17;4(1):vbae120. doi: 10.1093/bioadv/vbae120. eCollection 2024.
3
4
Modeling aspects of the language of life through transfer-learning protein sequences.通过转移学习蛋白质序列来模拟生命语言的各个方面。
BMC Bioinformatics. 2019 Dec 17;20(1):723. doi: 10.1186/s12859-019-3220-8.
5
DeepAdd: Protein function prediction from k-mer embedding and additional features.DeepAdd:基于 k -mer 嵌入和附加特征的蛋白质功能预测。
Comput Biol Chem. 2020 Dec;89:107379. doi: 10.1016/j.compbiolchem.2020.107379. Epub 2020 Sep 23.
6
UniRef: comprehensive and non-redundant UniProt reference clusters.UniRef:全面且无冗余的UniProt参考簇。
Bioinformatics. 2007 May 15;23(10):1282-8. doi: 10.1093/bioinformatics/btm098. Epub 2007 Mar 22.
7
[Analysis, identification and correction of some errors of model refseqs appeared in NCBI Human Gene Database by in silico cloning and experimental verification of novel human genes].[通过新型人类基因的电子克隆和实验验证对NCBI人类基因数据库中出现的模型参考序列的一些错误进行分析、鉴定和校正]
Yi Chuan Xue Bao. 2004 May;31(5):431-43.
8
SAP: Synteny-aware gene function prediction for bacteria using protein embeddings.SAP:利用蛋白质嵌入对细菌进行共线性感知基因功能预测。
bioRxiv. 2023 Nov 21:2023.05.02.539034. doi: 10.1101/2023.05.02.539034.
9
SUS-BAR: a database of pig proteins with statistically validated structural and functional annotation.SUS-BAR:一个具有统计学验证的结构和功能注释的猪蛋白数据库。
Database (Oxford). 2013 Sep 23;2013:bat065. doi: 10.1093/database/bat065. Print 2013.
10
Organizing the bacterial annotation space with amino acid sequence embeddings.利用氨基酸序列嵌入来组织细菌注释空间。
BMC Bioinformatics. 2022 Sep 23;23(1):385. doi: 10.1186/s12859-022-04930-5.

引用本文的文献

1
Large scale paired antibody language models.大规模配对抗体语言模型。
PLoS Comput Biol. 2024 Dec 6;20(12):e1012646. doi: 10.1371/journal.pcbi.1012646. eCollection 2024 Dec.

本文引用的文献

1
Enzyme function prediction using contrastive learning.使用对比学习进行酶功能预测。
Science. 2023 Mar 31;379(6639):1358-1363. doi: 10.1126/science.adf2465. Epub 2023 Mar 30.
2
ProteInfer, deep neural networks for protein functional inference.蛋白推断,用于蛋白质功能推断的深度神经网络。
Elife. 2023 Feb 27;12:e80942. doi: 10.7554/eLife.80942.
3
Integrating unsupervised language model with triplet neural networks for protein gene ontology prediction.将无监督语言模型与三重态神经网络集成,用于蛋白质基因本体预测。
PLoS Comput Biol. 2022 Dec 22;18(12):e1010793. doi: 10.1371/journal.pcbi.1010793. eCollection 2022 Dec.
4
Using deep learning to annotate the protein universe.利用深度学习标注蛋白质宇宙。
Nat Biotechnol. 2022 Jun;40(6):932-937. doi: 10.1038/s41587-021-01179-w. Epub 2022 Feb 21.
5
ProtTrans: Toward Understanding the Language of Life Through Self-Supervised Learning.ProtTrans:通过自监督学习理解生命语言。
IEEE Trans Pattern Anal Mach Intell. 2022 Oct;44(10):7112-7127. doi: 10.1109/TPAMI.2021.3095381. Epub 2022 Sep 14.
6
Structure-based protein function prediction using graph convolutional networks.基于结构的蛋白质功能预测使用图卷积网络。
Nat Commun. 2021 May 26;12(1):3168. doi: 10.1038/s41467-021-23303-9.
7
Ranking microbial metabolomic and genomic links in the NPLinker framework using complementary scoring functions.在 NPLinker 框架中使用互补评分函数对微生物代谢组学和基因组学关联进行排名。
PLoS Comput Biol. 2021 May 4;17(5):e1008920. doi: 10.1371/journal.pcbi.1008920. eCollection 2021 May.
8
Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences.生物结构和功能源于将无监督学习扩展到 2.5 亿个蛋白质序列。
Proc Natl Acad Sci U S A. 2021 Apr 13;118(15). doi: 10.1073/pnas.2016239118.
9
Sensitive protein alignments at tree-of-life scale using DIAMOND.使用 DIAMOND 进行生命之树尺度上的敏感蛋白质比对。
Nat Methods. 2021 Apr;18(4):366-368. doi: 10.1038/s41592-021-01101-x. Epub 2021 Apr 7.
10
NetQuilt: deep multispecies network-based protein function prediction using homology-informed network similarity.NetQuilt:基于深度多物种网络的蛋白质功能预测,利用同源性信息网络相似性
Bioinformatics. 2021 Aug 25;37(16):2414-2422. doi: 10.1093/bioinformatics/btab098.