• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

TALE:基于 Transformer 的蛋白质功能注释与联合序列-标签嵌入。

TALE: Transformer-based protein function Annotation with joint sequence-Label Embedding.

机构信息

Department of Electrical and Computer Engineering, Texas A&M University, College Station, TX 77843, USA.

出版信息

Bioinformatics. 2021 Sep 29;37(18):2825-2833. doi: 10.1093/bioinformatics/btab198.

DOI:10.1093/bioinformatics/btab198
PMID:33755048
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC8479653/
Abstract

MOTIVATION

Facing the increasing gap between high-throughput sequence data and limited functional insights, computational protein function annotation provides a high-throughput alternative to experimental approaches. However, current methods can have limited applicability while relying on protein data besides sequences, or lack generalizability to novel sequences, species and functions.

RESULTS

To overcome aforementioned barriers in applicability and generalizability, we propose a novel deep learning model using only sequence information for proteins, named Transformer-based protein function Annotation through joint sequence-Label Embedding (TALE). For generalizability to novel sequences we use self-attention-based transformers to capture global patterns in sequences. For generalizability to unseen or rarely seen functions (tail labels), we embed protein function labels (hierarchical GO terms on directed graphs) together with inputs/features (1D sequences) in a joint latent space. Combining TALE and a sequence similarity-based method, TALE+ outperformed competing methods when only sequence input is available. It even outperformed a state-of-the-art method using network information besides sequence, in two of the three gene ontologies. Furthermore, TALE and TALE+ showed superior generalizability to proteins of low similarity, new species, or rarely annotated functions compared to training data, revealing deep insights into the protein sequence-function relationship. Ablation studies elucidated contributions of algorithmic components toward the accuracy and the generalizability; and a GO term-centric analysis was also provided.

AVAILABILITY AND IMPLEMENTATION

The data, source codes and models are available at https://github.com/Shen-Lab/TALE.

SUPPLEMENTARY INFORMATION

Supplementary data are available at Bioinformatics online.

摘要

动机

面对高通量测序数据与有限的功能理解之间日益扩大的差距,计算蛋白质功能注释为实验方法提供了一种高通量的替代方法。然而,当前的方法可能具有有限的适用性,同时依赖于除序列以外的蛋白质数据,或者缺乏对新序列、物种和功能的泛化能力。

结果

为了克服适用性和泛化能力方面的上述障碍,我们提出了一种仅使用序列信息的新型深度学习模型,名为基于 Transformer 的蛋白质功能注释通过联合序列-标签嵌入(TALE)。为了实现对新序列的泛化能力,我们使用基于自注意力的转换器来捕获序列中的全局模式。为了实现对看不见或很少见的功能(尾部标签)的泛化能力,我们将蛋白质功能标签(有向图上的分层 GO 术语)与输入/特征(1D 序列)一起嵌入到联合潜在空间中。当只有序列输入时,TALE 和基于序列相似性的方法 TALE+的表现优于竞争方法。在三个基因本体论中的两个中,它甚至优于使用网络信息除序列之外的最新方法。此外,与训练数据相比,TALE 和 TALE+在低相似度、新物种或很少注释的功能的蛋白质方面表现出更好的泛化能力,这揭示了蛋白质序列-功能关系的深刻见解。消融研究阐明了算法组件对准确性和泛化能力的贡献;还提供了基于 GO 术语的分析。

可用性和实现

数据、源代码和模型可在 https://github.com/Shen-Lab/TALE 上获得。

补充信息

补充数据可在生物信息学在线获得。

相似文献

1
TALE: Transformer-based protein function Annotation with joint sequence-Label Embedding.TALE:基于 Transformer 的蛋白质功能注释与联合序列-标签嵌入。
Bioinformatics. 2021 Sep 29;37(18):2825-2833. doi: 10.1093/bioinformatics/btab198.
2
PFresGO: an attention mechanism-based deep-learning approach for protein annotation by integrating gene ontology inter-relationships.PFresGO:一种基于注意力机制的深度学习方法,通过整合基因本体论的相互关系来进行蛋白质注释。
Bioinformatics. 2023 Mar 1;39(3). doi: 10.1093/bioinformatics/btad094.
3
Improving protein function prediction using protein sequence and GO-term similarities.利用蛋白质序列和 GO 术语相似性提高蛋白质功能预测。
Bioinformatics. 2019 Apr 1;35(7):1116-1124. doi: 10.1093/bioinformatics/bty751.
4
Cross-modality and self-supervised protein embedding for compound-protein affinity and contact prediction.跨模态和自监督的蛋白质嵌入方法用于化合物-蛋白质亲和力和接触预测。
Bioinformatics. 2022 Sep 16;38(Suppl_2):ii68-ii74. doi: 10.1093/bioinformatics/btac470.
5
DeepGO: predicting protein functions from sequence and interactions using a deep ontology-aware classifier.DeepGO:使用深度本体感知分类器从序列和相互作用预测蛋白质功能。
Bioinformatics. 2018 Feb 15;34(4):660-668. doi: 10.1093/bioinformatics/btx624.
6
TransformerGO: predicting protein-protein interactions by modelling the attention between sets of gene ontology terms.TransformerGO:通过建模基因本体论术语集之间的注意力来预测蛋白质-蛋白质相互作用。
Bioinformatics. 2022 Apr 12;38(8):2269-2277. doi: 10.1093/bioinformatics/btac104.
7
Hierarchical deep learning for predicting GO annotations by integrating protein knowledge.基于蛋白质知识的 GO 注释预测的分层深度学习
Bioinformatics. 2022 Sep 30;38(19):4488-4496. doi: 10.1093/bioinformatics/btac536.
8
Exploiting ontology graph for predicting sparsely annotated gene function.利用本体图预测注释稀疏的基因功能。
Bioinformatics. 2015 Jun 15;31(12):i357-64. doi: 10.1093/bioinformatics/btv260.
9
GOLabeler: improving sequence-based large-scale protein function prediction by learning to rank.GOLabeler:通过学习排序提高基于序列的大规模蛋白质功能预测。
Bioinformatics. 2018 Jul 15;34(14):2465-2473. doi: 10.1093/bioinformatics/bty130.
10
Hayai-Annotation Plants: an ultra-fast and comprehensive functional gene annotation system in plants.海牙注释植物:一个超快速和全面的植物功能基因注释系统。
Bioinformatics. 2019 Nov 1;35(21):4427-4429. doi: 10.1093/bioinformatics/btz380.

引用本文的文献

1
MKFGO: integrating multi-source knowledge fusion with pretrained language model for high-accuracy protein function prediction.MKFGO:将多源知识融合与预训练语言模型相结合用于高精度蛋白质功能预测
Brief Bioinform. 2025 Jul 2;26(4). doi: 10.1093/bib/bbaf420.
2
Application of Protein Structure Encodings and Sequence Embeddings for Transporter Substrate Prediction.蛋白质结构编码和序列嵌入在转运蛋白底物预测中的应用。
Molecules. 2025 Aug 1;30(15):3226. doi: 10.3390/molecules30153226.
3
GOAnnotator: accurate protein function annotation using automatically retrieved literature.GO注释器:利用自动检索的文献进行准确的蛋白质功能注释。
Bioinformatics. 2025 Jul 1;41(Supplement_1):i410-i419. doi: 10.1093/bioinformatics/btaf199.
4
POSA-GO: Fusion of Hierarchical Gene Ontology and Protein Language Models for Protein Function Prediction.POSA-GO:用于蛋白质功能预测的分层基因本体与蛋白质语言模型融合
Int J Mol Sci. 2025 Jul 1;26(13):6362. doi: 10.3390/ijms26136362.
5
A Survey of Biological Function Prediction Methods with Focus on Natural Language Processing (NLP) and Large Language Models (LLM).以自然语言处理(NLP)和大语言模型(LLM)为重点的生物功能预测方法综述。
Methods Mol Biol. 2025;2941:201-225. doi: 10.1007/978-1-0716-4623-6_13.
6
GOBoost: leveraging long-tail gene ontology terms for accurate protein function prediction.GOBoost:利用长尾基因本体术语进行准确的蛋白质功能预测。
Bioinformatics. 2025 Jun 2;41(6). doi: 10.1093/bioinformatics/btaf267.
7
Multi-stage attention-based extraction and fusion of protein sequence and structural features for protein function prediction.基于多阶段注意力机制的蛋白质序列与结构特征提取及融合用于蛋白质功能预测
Bioinformatics. 2025 Jun 26. doi: 10.1093/bioinformatics/btaf374.
8
ProtFun: A Protein Function Prediction Model Using Graph Attention Networks with a Protein Large Language Model.ProtFun:一种使用图注意力网络和蛋白质大语言模型的蛋白质功能预测模型。
bioRxiv. 2025 May 17:2025.05.13.653854. doi: 10.1101/2025.05.13.653854.
9
GTPLM-GO: Enhancing Protein Function Prediction Through Dual-Branch Graph Transformer and Protein Language Model Fusing Sequence and Local-Global PPI Information.GTPLM-GO:通过融合序列和局部-全局蛋白质-蛋白质相互作用信息的双分支图变换器和蛋白质语言模型增强蛋白质功能预测
Int J Mol Sci. 2025 Apr 25;26(9):4088. doi: 10.3390/ijms26094088.
10
A multimodal model for protein function prediction.一种用于蛋白质功能预测的多模态模型。
Sci Rep. 2025 Mar 26;15(1):10465. doi: 10.1038/s41598-025-94612-y.

本文引用的文献

1
Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences.生物结构和功能源于将无监督学习扩展到 2.5 亿个蛋白质序列。
Proc Natl Acad Sci U S A. 2021 Apr 13;118(15). doi: 10.1073/pnas.2016239118.
2
Graph2GO: a multi-modal attributed network embedding method for inferring protein functions.Graph2GO:一种用于推断蛋白质功能的多模态属性网络嵌入方法。
Gigascience. 2020 Aug 1;9(8). doi: 10.1093/gigascience/giaa081.
3
Cryo-EM structure of the 2019-nCoV spike in the prefusion conformation.2019 年新型冠状病毒刺突蛋白在预融合构象的冷冻电镜结构
Science. 2020 Mar 13;367(6483):1260-1263. doi: 10.1126/science.abb2507. Epub 2020 Feb 19.
4
The CAFA challenge reports improved protein function prediction and new functional annotations for hundreds of genes through experimental screens.CAFA 挑战赛报告称,通过实验筛选,提高了数百个基因的蛋白质功能预测和新的功能注释。
Genome Biol. 2019 Nov 19;20(1):244. doi: 10.1186/s13059-019-1835-8.
5
DeepGOPlus: improved protein function prediction from sequence.DeepGOPlus:从序列中改进蛋白质功能预测。
Bioinformatics. 2020 Jan 15;36(2):422-429. doi: 10.1093/bioinformatics/btz595.
6
NetGO: improving large-scale protein function prediction with massive network information.NetGO:利用大规模网络信息提高大规模蛋白质功能预测。
Nucleic Acids Res. 2019 Jul 2;47(W1):W379-W387. doi: 10.1093/nar/gkz388.
7
DEEPred: Automated Protein Function Prediction with Multi-task Feed-forward Deep Neural Networks.DEEPred:基于多任务前馈深度神经网络的蛋白质自动功能预测。
Sci Rep. 2019 May 14;9(1):7344. doi: 10.1038/s41598-019-43708-3.
8
DeepFunc: A Deep Learning Framework for Accurate Prediction of Protein Functions from Protein Sequences and Interactions.DeepFunc:一种从蛋白质序列和相互作用中准确预测蛋白质功能的深度学习框架。
Proteomics. 2019 Jun;19(12):e1900019. doi: 10.1002/pmic.201900019. Epub 2019 May 27.
9
UniProt: a worldwide hub of protein knowledge.UniProt:蛋白质知识的全球枢纽。
Nucleic Acids Res. 2019 Jan 8;47(D1):D506-D515. doi: 10.1093/nar/gky1049.
10
Predicting human protein function with multi-task deep neural networks.用多任务深度神经网络预测人类蛋白质功能。
PLoS One. 2018 Jun 11;13(6):e0198216. doi: 10.1371/journal.pone.0198216. eCollection 2018.