Suppr超能文献

TALE:基于 Transformer 的蛋白质功能注释与联合序列-标签嵌入。

TALE: Transformer-based protein function Annotation with joint sequence-Label Embedding.

机构信息

Department of Electrical and Computer Engineering, Texas A&M University, College Station, TX 77843, USA.

出版信息

Bioinformatics. 2021 Sep 29;37(18):2825-2833. doi: 10.1093/bioinformatics/btab198.

Abstract

MOTIVATION

Facing the increasing gap between high-throughput sequence data and limited functional insights, computational protein function annotation provides a high-throughput alternative to experimental approaches. However, current methods can have limited applicability while relying on protein data besides sequences, or lack generalizability to novel sequences, species and functions.

RESULTS

To overcome aforementioned barriers in applicability and generalizability, we propose a novel deep learning model using only sequence information for proteins, named Transformer-based protein function Annotation through joint sequence-Label Embedding (TALE). For generalizability to novel sequences we use self-attention-based transformers to capture global patterns in sequences. For generalizability to unseen or rarely seen functions (tail labels), we embed protein function labels (hierarchical GO terms on directed graphs) together with inputs/features (1D sequences) in a joint latent space. Combining TALE and a sequence similarity-based method, TALE+ outperformed competing methods when only sequence input is available. It even outperformed a state-of-the-art method using network information besides sequence, in two of the three gene ontologies. Furthermore, TALE and TALE+ showed superior generalizability to proteins of low similarity, new species, or rarely annotated functions compared to training data, revealing deep insights into the protein sequence-function relationship. Ablation studies elucidated contributions of algorithmic components toward the accuracy and the generalizability; and a GO term-centric analysis was also provided.

AVAILABILITY AND IMPLEMENTATION

The data, source codes and models are available at https://github.com/Shen-Lab/TALE.

SUPPLEMENTARY INFORMATION

Supplementary data are available at Bioinformatics online.

摘要

动机

面对高通量测序数据与有限的功能理解之间日益扩大的差距,计算蛋白质功能注释为实验方法提供了一种高通量的替代方法。然而,当前的方法可能具有有限的适用性,同时依赖于除序列以外的蛋白质数据,或者缺乏对新序列、物种和功能的泛化能力。

结果

为了克服适用性和泛化能力方面的上述障碍,我们提出了一种仅使用序列信息的新型深度学习模型,名为基于 Transformer 的蛋白质功能注释通过联合序列-标签嵌入(TALE)。为了实现对新序列的泛化能力,我们使用基于自注意力的转换器来捕获序列中的全局模式。为了实现对看不见或很少见的功能(尾部标签)的泛化能力,我们将蛋白质功能标签(有向图上的分层 GO 术语)与输入/特征(1D 序列)一起嵌入到联合潜在空间中。当只有序列输入时,TALE 和基于序列相似性的方法 TALE+的表现优于竞争方法。在三个基因本体论中的两个中,它甚至优于使用网络信息除序列之外的最新方法。此外,与训练数据相比,TALE 和 TALE+在低相似度、新物种或很少注释的功能的蛋白质方面表现出更好的泛化能力,这揭示了蛋白质序列-功能关系的深刻见解。消融研究阐明了算法组件对准确性和泛化能力的贡献;还提供了基于 GO 术语的分析。

可用性和实现

数据、源代码和模型可在 https://github.com/Shen-Lab/TALE 上获得。

补充信息

补充数据可在生物信息学在线获得。

相似文献

1
TALE: Transformer-based protein function Annotation with joint sequence-Label Embedding.
Bioinformatics. 2021 Sep 29;37(18):2825-2833. doi: 10.1093/bioinformatics/btab198.
3
Improving protein function prediction using protein sequence and GO-term similarities.
Bioinformatics. 2019 Apr 1;35(7):1116-1124. doi: 10.1093/bioinformatics/bty751.
4
Cross-modality and self-supervised protein embedding for compound-protein affinity and contact prediction.
Bioinformatics. 2022 Sep 16;38(Suppl_2):ii68-ii74. doi: 10.1093/bioinformatics/btac470.
5
DeepGO: predicting protein functions from sequence and interactions using a deep ontology-aware classifier.
Bioinformatics. 2018 Feb 15;34(4):660-668. doi: 10.1093/bioinformatics/btx624.
6
TransformerGO: predicting protein-protein interactions by modelling the attention between sets of gene ontology terms.
Bioinformatics. 2022 Apr 12;38(8):2269-2277. doi: 10.1093/bioinformatics/btac104.
7
Hierarchical deep learning for predicting GO annotations by integrating protein knowledge.
Bioinformatics. 2022 Sep 30;38(19):4488-4496. doi: 10.1093/bioinformatics/btac536.
8
Exploiting ontology graph for predicting sparsely annotated gene function.
Bioinformatics. 2015 Jun 15;31(12):i357-64. doi: 10.1093/bioinformatics/btv260.
9
GOLabeler: improving sequence-based large-scale protein function prediction by learning to rank.
Bioinformatics. 2018 Jul 15;34(14):2465-2473. doi: 10.1093/bioinformatics/bty130.
10
Hayai-Annotation Plants: an ultra-fast and comprehensive functional gene annotation system in plants.
Bioinformatics. 2019 Nov 1;35(21):4427-4429. doi: 10.1093/bioinformatics/btz380.

引用本文的文献

3
GOAnnotator: accurate protein function annotation using automatically retrieved literature.
Bioinformatics. 2025 Jul 1;41(Supplement_1):i410-i419. doi: 10.1093/bioinformatics/btaf199.
6
GOBoost: leveraging long-tail gene ontology terms for accurate protein function prediction.
Bioinformatics. 2025 Jun 2;41(6). doi: 10.1093/bioinformatics/btaf267.
10
A multimodal model for protein function prediction.
Sci Rep. 2025 Mar 26;15(1):10465. doi: 10.1038/s41598-025-94612-y.

本文引用的文献

1
Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences.
Proc Natl Acad Sci U S A. 2021 Apr 13;118(15). doi: 10.1073/pnas.2016239118.
2
Graph2GO: a multi-modal attributed network embedding method for inferring protein functions.
Gigascience. 2020 Aug 1;9(8). doi: 10.1093/gigascience/giaa081.
3
Cryo-EM structure of the 2019-nCoV spike in the prefusion conformation.
Science. 2020 Mar 13;367(6483):1260-1263. doi: 10.1126/science.abb2507. Epub 2020 Feb 19.
5
DeepGOPlus: improved protein function prediction from sequence.
Bioinformatics. 2020 Jan 15;36(2):422-429. doi: 10.1093/bioinformatics/btz595.
6
NetGO: improving large-scale protein function prediction with massive network information.
Nucleic Acids Res. 2019 Jul 2;47(W1):W379-W387. doi: 10.1093/nar/gkz388.
7
8
DeepFunc: A Deep Learning Framework for Accurate Prediction of Protein Functions from Protein Sequences and Interactions.
Proteomics. 2019 Jun;19(12):e1900019. doi: 10.1002/pmic.201900019. Epub 2019 May 27.
9
UniProt: a worldwide hub of protein knowledge.
Nucleic Acids Res. 2019 Jan 8;47(D1):D506-D515. doi: 10.1093/nar/gky1049.
10
Predicting human protein function with multi-task deep neural networks.
PLoS One. 2018 Jun 11;13(6):e0198216. doi: 10.1371/journal.pone.0198216. eCollection 2018.

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验