Department of Electrical and Computer Engineering, Texas A&M University, College Station, TX 77843, USA.
Bioinformatics. 2021 Sep 29;37(18):2825-2833. doi: 10.1093/bioinformatics/btab198.
Facing the increasing gap between high-throughput sequence data and limited functional insights, computational protein function annotation provides a high-throughput alternative to experimental approaches. However, current methods can have limited applicability while relying on protein data besides sequences, or lack generalizability to novel sequences, species and functions.
To overcome aforementioned barriers in applicability and generalizability, we propose a novel deep learning model using only sequence information for proteins, named Transformer-based protein function Annotation through joint sequence-Label Embedding (TALE). For generalizability to novel sequences we use self-attention-based transformers to capture global patterns in sequences. For generalizability to unseen or rarely seen functions (tail labels), we embed protein function labels (hierarchical GO terms on directed graphs) together with inputs/features (1D sequences) in a joint latent space. Combining TALE and a sequence similarity-based method, TALE+ outperformed competing methods when only sequence input is available. It even outperformed a state-of-the-art method using network information besides sequence, in two of the three gene ontologies. Furthermore, TALE and TALE+ showed superior generalizability to proteins of low similarity, new species, or rarely annotated functions compared to training data, revealing deep insights into the protein sequence-function relationship. Ablation studies elucidated contributions of algorithmic components toward the accuracy and the generalizability; and a GO term-centric analysis was also provided.
The data, source codes and models are available at https://github.com/Shen-Lab/TALE.
Supplementary data are available at Bioinformatics online.
面对高通量测序数据与有限的功能理解之间日益扩大的差距,计算蛋白质功能注释为实验方法提供了一种高通量的替代方法。然而,当前的方法可能具有有限的适用性,同时依赖于除序列以外的蛋白质数据,或者缺乏对新序列、物种和功能的泛化能力。
为了克服适用性和泛化能力方面的上述障碍,我们提出了一种仅使用序列信息的新型深度学习模型,名为基于 Transformer 的蛋白质功能注释通过联合序列-标签嵌入(TALE)。为了实现对新序列的泛化能力,我们使用基于自注意力的转换器来捕获序列中的全局模式。为了实现对看不见或很少见的功能(尾部标签)的泛化能力,我们将蛋白质功能标签(有向图上的分层 GO 术语)与输入/特征(1D 序列)一起嵌入到联合潜在空间中。当只有序列输入时,TALE 和基于序列相似性的方法 TALE+的表现优于竞争方法。在三个基因本体论中的两个中,它甚至优于使用网络信息除序列之外的最新方法。此外,与训练数据相比,TALE 和 TALE+在低相似度、新物种或很少注释的功能的蛋白质方面表现出更好的泛化能力,这揭示了蛋白质序列-功能关系的深刻见解。消融研究阐明了算法组件对准确性和泛化能力的贡献;还提供了基于 GO 术语的分析。
数据、源代码和模型可在 https://github.com/Shen-Lab/TALE 上获得。
补充数据可在生物信息学在线获得。