• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

深度学习的嵌入信息可以将 GO 注释扩展到同源之外。

Embeddings from deep learning transfer GO annotations beyond homology.

机构信息

Department of Informatics, Bioinformatics and Computational Biology, i12, TUM (Technical University of Munich), Boltzmannstr. 3, Garching, 85748, Munich, Germany.

TUM Graduate School, Center of Doctoral Studies in Informatics and its Applications (CeDoSIA), Boltzmannstr. 11, 85748, Garching, Germany.

出版信息

Sci Rep. 2021 Jan 13;11(1):1160. doi: 10.1038/s41598-020-80786-0.

DOI:10.1038/s41598-020-80786-0
PMID:33441905
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC7806674/
Abstract

Knowing protein function is crucial to advance molecular and medical biology, yet experimental function annotations through the Gene Ontology (GO) exist for fewer than 0.5% of all known proteins. Computational methods bridge this sequence-annotation gap typically through homology-based annotation transfer by identifying sequence-similar proteins with known function or through prediction methods using evolutionary information. Here, we propose predicting GO terms through annotation transfer based on proximity of proteins in the SeqVec embedding rather than in sequence space. These embeddings originate from deep learned language models (LMs) for protein sequences (SeqVec) transferring the knowledge gained from predicting the next amino acid in 33 million protein sequences. Replicating the conditions of CAFA3, our method reaches an F of 37 ± 2%, 50 ± 3%, and 57 ± 2% for BPO, MFO, and CCO, respectively. Numerically, this appears close to the top ten CAFA3 methods. When restricting the annotation transfer to proteins with < 20% pairwise sequence identity to the query, performance drops (F BPO 33 ± 2%, MFO 43 ± 3%, CCO 53 ± 2%); this still outperforms naïve sequence-based transfer. Preliminary results from CAFA4 appear to confirm these findings. Overall, this new concept is likely to change the annotation of proteins, in particular for proteins from smaller families or proteins with intrinsically disordered regions.

摘要

了解蛋白质的功能对于推进分子和医学生物学至关重要,但通过基因本体论(GO)进行的实验功能注释不到所有已知蛋白质的 0.5%。计算方法通常通过同源性注释转移来填补这一序列-注释差距,方法是识别具有已知功能的序列相似蛋白,或使用进化信息的预测方法。在这里,我们提出通过基于蛋白质在 SeqVec 嵌入中的接近度而不是在序列空间中的注释转移来预测 GO 术语。这些嵌入源自用于蛋白质序列的深度学习语言模型(LMs)(SeqVec),它通过预测 3300 万条蛋白质序列中的下一个氨基酸来转移知识。在复制 CAFA3 条件的情况下,我们的方法分别达到了 BPO、MFO 和 CCO 的 F 值为 37±2%、50±3%和 57±2%。从数值上看,这似乎接近 CAFA3 方法的前十名。当将注释转移限制为与查询的蛋白质具有<20%的成对序列同一性时,性能会下降(BPO 的 F 值为 33±2%,MFO 的 F 值为 43±3%,CCO 的 F 值为 53±2%);这仍然优于基于序列的简单转移。来自 CAFA4 的初步结果似乎证实了这些发现。总体而言,这个新概念很可能改变蛋白质的注释,特别是对于来自较小家族的蛋白质或具有内在无序区域的蛋白质。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/672c/7806674/92c30bd34f8f/41598_2020_80786_Fig4_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/672c/7806674/a00e6b5c1672/41598_2020_80786_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/672c/7806674/9dfb099b8c88/41598_2020_80786_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/672c/7806674/42dec2119afd/41598_2020_80786_Fig3_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/672c/7806674/92c30bd34f8f/41598_2020_80786_Fig4_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/672c/7806674/a00e6b5c1672/41598_2020_80786_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/672c/7806674/9dfb099b8c88/41598_2020_80786_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/672c/7806674/42dec2119afd/41598_2020_80786_Fig3_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/672c/7806674/92c30bd34f8f/41598_2020_80786_Fig4_HTML.jpg

相似文献

1
Embeddings from deep learning transfer GO annotations beyond homology.深度学习的嵌入信息可以将 GO 注释扩展到同源之外。
Sci Rep. 2021 Jan 13;11(1):1160. doi: 10.1038/s41598-020-80786-0.
2
Accurate protein function prediction via graph attention networks with predicted structure information.通过结合预测结构信息的图注意力网络进行准确的蛋白质功能预测。
Brief Bioinform. 2022 Jan 17;23(1). doi: 10.1093/bib/bbab502.
3
GOLabeler: improving sequence-based large-scale protein function prediction by learning to rank.GOLabeler:通过学习排序提高基于序列的大规模蛋白质功能预测。
Bioinformatics. 2018 Jul 15;34(14):2465-2473. doi: 10.1093/bioinformatics/bty130.
4
Modeling aspects of the language of life through transfer-learning protein sequences.通过转移学习蛋白质序列来模拟生命语言的各个方面。
BMC Bioinformatics. 2019 Dec 17;20(1):723. doi: 10.1186/s12859-019-3220-8.
5
FunPredCATH: An ensemble method for predicting protein function using CATH.FunPredCATH:一种使用 CATH 预测蛋白质功能的集成方法。
Biochim Biophys Acta Proteins Proteom. 2024 Feb 1;1872(2):140985. doi: 10.1016/j.bbapap.2023.140985. Epub 2023 Dec 19.
6
MetaGO: Predicting Gene Ontology of Non-homologous Proteins Through Low-Resolution Protein Structure Prediction and Protein-Protein Network Mapping.MetaGO:通过低分辨率蛋白质结构预测和蛋白质-蛋白质网络映射预测非同源蛋白质的基因本体论。
J Mol Biol. 2018 Jul 20;430(15):2256-2265. doi: 10.1016/j.jmb.2018.03.004. Epub 2018 Mar 10.
7
Use of Chou's 5-steps rule to predict the subcellular localization of gram-negative and gram-positive bacterial proteins by multi-label learning based on gene ontology annotation and profile alignment.利用 Chou 的 5 步规则,通过基于基因本体论注释和序列比对的多标签学习,预测革兰氏阴性和革兰氏阳性细菌蛋白质的亚细胞定位。
J Integr Bioinform. 2020 Jun 29;18(1):51-79. doi: 10.1515/jib-2019-0091.
8
NetGO: improving large-scale protein function prediction with massive network information.NetGO:利用大规模网络信息提高大规模蛋白质功能预测。
Nucleic Acids Res. 2019 Jul 2;47(W1):W379-W387. doi: 10.1093/nar/gkz388.
9
Organizing the bacterial annotation space with amino acid sequence embeddings.利用氨基酸序列嵌入来组织细菌注释空间。
BMC Bioinformatics. 2022 Sep 23;23(1):385. doi: 10.1186/s12859-022-04930-5.
10
A Deep Learning Framework for Gene Ontology Annotations With Sequence- and Network-Based Information.基于序列和网络信息的基因本体论注释深度学习框架。
IEEE/ACM Trans Comput Biol Bioinform. 2021 Nov-Dec;18(6):2208-2217. doi: 10.1109/TCBB.2020.2968882. Epub 2021 Dec 8.

引用本文的文献

1
FANTASIA leverages language models to decode the functional dark proteome across the animal tree of life.FANTASIA利用语言模型来解码整个动物生命树中的功能性暗蛋白质组。
Commun Biol. 2025 Aug 14;8(1):1227. doi: 10.1038/s42003-025-08651-2.
2
Application of Protein Structure Encodings and Sequence Embeddings for Transporter Substrate Prediction.蛋白质结构编码和序列嵌入在转运蛋白底物预测中的应用。
Molecules. 2025 Aug 1;30(15):3226. doi: 10.3390/molecules30153226.
3
Comparative genomics of the parasite Trichomonas vaginalis reveals genes involved in spillover from birds to humans.

本文引用的文献

1
Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences.生物结构和功能源于将无监督学习扩展到 2.5 亿个蛋白质序列。
Proc Natl Acad Sci U S A. 2021 Apr 13;118(15). doi: 10.1073/pnas.2016239118.
2
ISMB 2020 proceedings.2020年智能系统分子生物学国际会议论文集
Bioinformatics. 2020 Jul 1;36(Suppl_1):i1-i2. doi: 10.1093/bioinformatics/btaa537.
3
ProNA2020 predicts protein-DNA, protein-RNA, and protein-protein binding proteins and residues from sequence.ProNA2020 可从序列预测蛋白质-DNA、蛋白质-RNA 和蛋白质-蛋白质结合蛋白及残基。
阴道毛滴虫的比较基因组学揭示了参与从鸟类传播到人类的基因。
Nat Commun. 2025 Jul 24;16(1):6487. doi: 10.1038/s41467-025-61483-w.
4
Progress and challenges for the application of machine learning for neglected tropical diseases.机器学习在 neglected tropical diseases 中的应用进展与挑战。 (注:“neglected tropical diseases”直译为“被忽视的热带病” )
F1000Res. 2025 May 20;12:287. doi: 10.12688/f1000research.129064.2. eCollection 2023.
5
A Survey of Biological Function Prediction Methods with Focus on Natural Language Processing (NLP) and Large Language Models (LLM).以自然语言处理(NLP)和大语言模型(LLM)为重点的生物功能预测方法综述。
Methods Mol Biol. 2025;2941:201-225. doi: 10.1007/978-1-0716-4623-6_13.
6
Functional Annotation of Proteomes Using Protein Language Models: A High-Throughput Implementation of the ProtTrans Model.使用蛋白质语言模型对蛋白质组进行功能注释:ProtTrans模型的高通量实现
Methods Mol Biol. 2025;2941:127-137. doi: 10.1007/978-1-0716-4623-6_8.
7
Medium-sized protein language models perform well at transfer learning on realistic datasets.中等规模的蛋白质语言模型在真实数据集上的迁移学习中表现良好。
Sci Rep. 2025 Jul 1;15(1):21400. doi: 10.1038/s41598-025-05674-x.
8
Annotating the microbial dark matter with HiFi-NN.用HiFi-NN注释微生物暗物质。
iScience. 2025 Apr 18;28(6):112480. doi: 10.1016/j.isci.2025.112480. eCollection 2025 Jun 20.
9
Connectivity and Adaptation Patterns of the Deep-Sea Ground-Forming Sponge Geodia hentscheli Across Its Entire Distribution.深海造地海绵Geodia hentscheli在其整个分布范围内的连通性和适应模式
Mol Biol Evol. 2025 Jul 1;42(7). doi: 10.1093/molbev/msaf145.
10
A systematic evaluation of the language-of-viral-escape model using multiple machine learning frameworks.使用多个机器学习框架对病毒逃逸模型语言进行的系统评估。
J R Soc Interface. 2025 Apr;22(225):20240598. doi: 10.1098/rsif.2024.0598. Epub 2025 Apr 30.
J Mol Biol. 2020 Mar 27;432(7):2428-2443. doi: 10.1016/j.jmb.2020.02.026. Epub 2020 Mar 4.
4
Modeling aspects of the language of life through transfer-learning protein sequences.通过转移学习蛋白质序列来模拟生命语言的各个方面。
BMC Bioinformatics. 2019 Dec 17;20(1):723. doi: 10.1186/s12859-019-3220-8.
5
The CAFA challenge reports improved protein function prediction and new functional annotations for hundreds of genes through experimental screens.CAFA 挑战赛报告称,通过实验筛选,提高了数百个基因的蛋白质功能预测和新的功能注释。
Genome Biol. 2019 Nov 19;20(1):244. doi: 10.1186/s13059-019-1835-8.
6
DeepGOPlus: improved protein function prediction from sequence.DeepGOPlus:从序列中改进蛋白质功能预测。
Bioinformatics. 2020 Jan 15;36(2):422-429. doi: 10.1093/bioinformatics/btz595.
7
Protein-level assembly increases protein sequence recovery from metagenomic samples manyfold.蛋白质水平的组装使宏基因组样本中蛋白质序列的回收率提高了许多倍。
Nat Methods. 2019 Jul;16(7):603-606. doi: 10.1038/s41592-019-0437-4. Epub 2019 Jun 24.
8
Clonal evolution and genome stability in a 2500-year-old fungal individual.2500 岁真菌个体中的克隆进化和基因组稳定性。
Proc Biol Sci. 2018 Dec 19;285(1893):20182233. doi: 10.1098/rspb.2018.2233.
9
The Gene Ontology Resource: 20 years and still GOing strong.《基因本体论资源:20 年,持续强大》
Nucleic Acids Res. 2019 Jan 8;47(D1):D330-D338. doi: 10.1093/nar/gky1055.
10
UniProt: a worldwide hub of protein knowledge.UniProt:蛋白质知识的全球枢纽。
Nucleic Acids Res. 2019 Jan 8;47(D1):D506-D515. doi: 10.1093/nar/gky1049.