利用注意力预测蛋白质序列的酶功能。

Predicting enzymatic function of protein sequences with attention.

机构信息

Univ Rennes, Inria, CNRS, IRISA-UMR 6074, Rennes 35000, France.

出版信息

Bioinformatics. 2023 Oct 3;39(10). doi: 10.1093/bioinformatics/btad620.

DOI:10.1093/bioinformatics/btad620

PMID:37874958

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC10612403/

Abstract

MOTIVATION

There is a growing number of available protein sequences, but only a limited amount has been manually annotated. For example, only 0.25% of all entries of UniProtKB are reviewed by human annotators. Further developing automatic tools to infer protein function from sequence alone can alleviate part of this gap. In this article, we investigate the potential of Transformer deep neural networks on a specific case of functional sequence annotation: the prediction of enzymatic classes.

RESULTS

We show that our EnzBert transformer models, trained to predict Enzyme Commission (EC) numbers by specialization of a protein language model, outperforms state-of-the-art tools for monofunctional enzyme class prediction based on sequences only. Accuracy is improved from 84% to 95% on the prediction of EC numbers at level two on the EC40 benchmark. To evaluate the prediction quality at level four, the most detailed level of EC numbers, we built two new time-based benchmarks for comparison with state-of-the-art methods ECPred and DeepEC: the macro-F1 score is respectively improved from 41% to 54% and from 20% to 26%. Finally, we also show that using a simple combination of attention maps is on par with, or better than, other classical interpretability methods on the EC prediction task. More specifically, important residues identified by attention maps tend to correspond to known catalytic sites. Quantitatively, we report a max F-Gain score of 96.05%, while classical interpretability methods reach 91.44% at best.

AVAILABILITY AND IMPLEMENTATION

Source code and datasets are respectively available at https://gitlab.inria.fr/nbuton/tfpc and https://doi.org/10.5281/zenodo.7253910.

摘要

动机

可用的蛋白质序列数量不断增加，但只有有限数量的序列经过人工注释。例如，UniProtKB 中的所有条目只有 0.25% 是由人类注释者审查的。进一步开发仅从序列推断蛋白质功能的自动工具可以缓解这一差距的一部分。在本文中，我们研究了 Transformer 深度神经网络在功能序列注释的特定情况下的潜力：酶类的预测。

结果

我们表明，我们的 EnzBert 转换器模型，通过专门化蛋白质语言模型来预测酶委员会（EC）编号，在基于序列的单功能酶类预测方面优于最先进的工具。在 EC40 基准测试的二级 EC 编号预测方面，准确性从 84%提高到 95%。为了评估在最详细的 EC 编号四级的预测质量，我们为与最先进的方法 ECPred 和 DeepEC 进行比较而构建了两个新的基于时间的基准：宏 F1 分数分别从 41%提高到 54%和从 20%提高到 26%。最后，我们还表明，在 EC 预测任务中，使用注意力图的简单组合与其他经典可解释性方法相当，甚至更好。更具体地说，注意力图中识别出的重要残基往往对应于已知的催化位点。在定量方面，我们报告了 96.05%的最大 F-Gain 分数，而经典的可解释性方法的最佳分数为 91.44%。

可用性和实现

源代码和数据集分别位于 https://gitlab.inria.fr/nbuton/tfpc 和 https://doi.org/10.5281/zenodo.7253910。

相似文献

Predicting enzymatic function of protein sequences with attention.利用注意力预测蛋白质序列的酶功能。

Bioinformatics. 2023 Oct 3;39(10). doi: 10.1093/bioinformatics/btad620.

Deep learning enables high-quality and high-throughput prediction of enzyme commission numbers.深度学习可实现酶委员会编号的高质量和高通量预测。

Proc Natl Acad Sci U S A. 2019 Jul 9;116(28):13996-14001. doi: 10.1073/pnas.1821905116. Epub 2019 Jun 20.

ECPred: a tool for the prediction of the enzymatic functions of protein sequences based on the EC nomenclature.ECPred：一种基于 EC 命名法预测蛋白质序列酶功能的工具。

BMC Bioinformatics. 2018 Sep 21;19(1):334. doi: 10.1186/s12859-018-2368-y.

An Interpretable Double-Scale Attention Model for Enzyme Protein Class Prediction Based on Transformer Encoders and Multi-Scale Convolutions.一种基于Transformer编码器和多尺度卷积的用于酶蛋白类别预测的可解释双尺度注意力模型。

Front Genet. 2022 Apr 1;13:885627. doi: 10.3389/fgene.2022.885627. eCollection 2022.

Biological sequence modeling with convolutional kernel networks.基于卷积核网络的生物序列建模。

Bioinformatics. 2019 Sep 15;35(18):3294-3302. doi: 10.1093/bioinformatics/btz094.

Combining protein sequences and structures with transformers and equivariant graph neural networks to predict protein function.将蛋白质序列和结构与转换器和等变图神经网络相结合，以预测蛋白质功能。

Bioinformatics. 2023 Jun 30;39(39 Suppl 1):i318-i325. doi: 10.1093/bioinformatics/btad208.

Improved datasets and evaluation methods for the automatic prediction of DNA-binding proteins.用于DNA结合蛋白自动预测的改进数据集和评估方法。

Bioinformatics. 2021 Dec 22;38(1):44-51. doi: 10.1093/bioinformatics/btab603.

EnzymeNet: residual neural networks model for Enzyme Commission number prediction.酶网络：用于预测酶委员会编号的残差神经网络模型。

Bioinform Adv. 2023 Nov 24;3(1):vbad173. doi: 10.1093/bioadv/vbad173. eCollection 2023.

AFTGAN: prediction of multi-type PPI based on attention free transformer and graph attention network.AFTGAN：基于注意力自由转换器和图注意力网络的多类型 PPI 预测。

Bioinformatics. 2023 Feb 3;39(2). doi: 10.1093/bioinformatics/btad052.

HN-PPISP: a hybrid network based on MLP-Mixer for protein-protein interaction site prediction.HN-PPISP：一种基于MLP-Mixer的用于蛋白质-蛋白质相互作用位点预测的混合网络。

Brief Bioinform. 2023 Jan 19;24(1). doi: 10.1093/bib/bbac480.

引用本文的文献

Pool PaRTI: a PageRank-based pooling method for identifying critical residues and enhancing protein sequence representations.Pool PaRTI：一种基于PageRank的池化方法，用于识别关键残基并增强蛋白质序列表示。

Bioinformatics. 2025 Jun 2;41(6). doi: 10.1093/bioinformatics/btaf330.

Pool PaRTI: A PageRank-Based Pooling Method for Identifying Critical Residues and Enhancing Protein Sequence Representations.Pool PaRTI：一种基于PageRank的池化方法，用于识别关键残基并增强蛋白质序列表示。

bioRxiv. 2025 Mar 17:2024.10.04.616701. doi: 10.1101/2024.10.04.616701.

Comparative Assessment of Protein Large Language Models for Enzyme Commission Number Prediction.用于酶委员会编号预测的蛋白质大语言模型的比较评估

BMC Bioinformatics. 2025 Feb 27;26(1):68. doi: 10.1186/s12859-025-06081-9.

DeepES: deep learning-based enzyme screening to identify orphan enzyme genes.DeepES：基于深度学习的酶筛选以鉴定孤儿酶基因。

Bioinformatics. 2025 Mar 4;41(3). doi: 10.1093/bioinformatics/btaf053.

ProteinReDiff: Complex-based ligand-binding proteins redesign by equivariant diffusion-based generative models.ProteinReDiff：基于等变扩散生成模型的基于复合物的配体结合蛋白重新设计

Struct Dyn. 2024 Nov 25;11(6):064102. doi: 10.1063/4.0000271. eCollection 2024 Nov.

Opportunities and Challenges for Machine Learning-Assisted Enzyme Engineering.机器学习辅助酶工程面临的机遇与挑战

ACS Cent Sci. 2024 Feb 5;10(2):226-241. doi: 10.1021/acscentsci.3c01275. eCollection 2024 Feb 28.

Precision enzyme discovery through targeted mining of metagenomic data.通过对宏基因组数据进行靶向挖掘实现精准酶的发现。

Nat Prod Bioprospect. 2024 Jan 11;14(1):7. doi: 10.1007/s13659-023-00426-8.

本文引用的文献

Enzyme function prediction using contrastive learning.使用对比学习进行酶功能预测。

Science. 2023 Mar 31;379(6639):1358-1363. doi: 10.1126/science.adf2465. Epub 2023 Mar 30.

ProteInfer, deep neural networks for protein functional inference.蛋白推断，用于蛋白质功能推断的深度神经网络。

Elife. 2023 Feb 27;12:e80942. doi: 10.7554/eLife.80942.

ABLE: Attention based learning for enzyme classification.ABLE：基于注意力的酶分类学习。

Comput Biol Chem. 2021 Oct;94:107558. doi: 10.1016/j.compbiolchem.2021.107558. Epub 2021 Aug 19.

ProtTrans: Toward Understanding the Language of Life Through Self-Supervised Learning.ProtTrans：通过自监督学习理解生命语言。

IEEE Trans Pattern Anal Mach Intell. 2022 Oct;44(10):7112-7127. doi: 10.1109/TPAMI.2021.3095381. Epub 2022 Sep 14.

BENZ WS: the Bologna ENZyme Web Server for four-level EC number annotation.BENZ WS：用于四级 EC 编号注释的博洛尼亚酶网络服务器。

Nucleic Acids Res. 2021 Jul 2;49(W1):W60-W66. doi: 10.1093/nar/gkab328.

Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences.生物结构和功能源于将无监督学习扩展到 2.5 亿个蛋白质序列。

Proc Natl Acad Sci U S A. 2021 Apr 13;118(15). doi: 10.1073/pnas.2016239118.

Evaluating Protein Transfer Learning with TAPE.使用TAPE评估蛋白质迁移学习。

Adv Neural Inf Process Syst. 2019 Dec;32:9689-9701.

UniProt: the universal protein knowledgebase in 2021.UniProt：2021 年的通用蛋白质知识库。

Nucleic Acids Res. 2021 Jan 8;49(D1):D480-D489. doi: 10.1093/nar/gkaa1100.

UDSMProt: universal deep sequence models for protein classification.UDSMProt：用于蛋白质分类的通用深度序列模型。

Bioinformatics. 2020 Apr 15;36(8):2401-2409. doi: 10.1093/bioinformatics/btaa003.

The CAFA challenge reports improved protein function prediction and new functional annotations for hundreds of genes through experimental screens.CAFA 挑战赛报告称，通过实验筛选，提高了数百个基因的蛋白质功能预测和新的功能注释。

Genome Biol. 2019 Nov 19;20(1):244. doi: 10.1186/s13059-019-1835-8.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

利用注意力预测蛋白质序列的酶功能。

Predicting enzymatic function of protein sequences with attention.

机构信息

出版信息

MOTIVATION

RESULTS

AVAILABILITY AND IMPLEMENTATION

动机

结果

可用性和实现

相似文献

引用本文的文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献