Suppr超能文献

利用注意力预测蛋白质序列的酶功能。

Predicting enzymatic function of protein sequences with attention.

机构信息

Univ Rennes, Inria, CNRS, IRISA-UMR 6074, Rennes 35000, France.

出版信息

Bioinformatics. 2023 Oct 3;39(10). doi: 10.1093/bioinformatics/btad620.

Abstract

MOTIVATION

There is a growing number of available protein sequences, but only a limited amount has been manually annotated. For example, only 0.25% of all entries of UniProtKB are reviewed by human annotators. Further developing automatic tools to infer protein function from sequence alone can alleviate part of this gap. In this article, we investigate the potential of Transformer deep neural networks on a specific case of functional sequence annotation: the prediction of enzymatic classes.

RESULTS

We show that our EnzBert transformer models, trained to predict Enzyme Commission (EC) numbers by specialization of a protein language model, outperforms state-of-the-art tools for monofunctional enzyme class prediction based on sequences only. Accuracy is improved from 84% to 95% on the prediction of EC numbers at level two on the EC40 benchmark. To evaluate the prediction quality at level four, the most detailed level of EC numbers, we built two new time-based benchmarks for comparison with state-of-the-art methods ECPred and DeepEC: the macro-F1 score is respectively improved from 41% to 54% and from 20% to 26%. Finally, we also show that using a simple combination of attention maps is on par with, or better than, other classical interpretability methods on the EC prediction task. More specifically, important residues identified by attention maps tend to correspond to known catalytic sites. Quantitatively, we report a max F-Gain score of 96.05%, while classical interpretability methods reach 91.44% at best.

AVAILABILITY AND IMPLEMENTATION

Source code and datasets are respectively available at https://gitlab.inria.fr/nbuton/tfpc and https://doi.org/10.5281/zenodo.7253910.

摘要

动机

可用的蛋白质序列数量不断增加,但只有有限数量的序列经过人工注释。例如,UniProtKB 中的所有条目只有 0.25% 是由人类注释者审查的。进一步开发仅从序列推断蛋白质功能的自动工具可以缓解这一差距的一部分。在本文中,我们研究了 Transformer 深度神经网络在功能序列注释的特定情况下的潜力:酶类的预测。

结果

我们表明,我们的 EnzBert 转换器模型,通过专门化蛋白质语言模型来预测酶委员会(EC)编号,在基于序列的单功能酶类预测方面优于最先进的工具。在 EC40 基准测试的二级 EC 编号预测方面,准确性从 84%提高到 95%。为了评估在最详细的 EC 编号四级的预测质量,我们为与最先进的方法 ECPred 和 DeepEC 进行比较而构建了两个新的基于时间的基准:宏 F1 分数分别从 41%提高到 54%和从 20%提高到 26%。最后,我们还表明,在 EC 预测任务中,使用注意力图的简单组合与其他经典可解释性方法相当,甚至更好。更具体地说,注意力图中识别出的重要残基往往对应于已知的催化位点。在定量方面,我们报告了 96.05%的最大 F-Gain 分数,而经典的可解释性方法的最佳分数为 91.44%。

可用性和实现

源代码和数据集分别位于 https://gitlab.inria.fr/nbuton/tfpchttps://doi.org/10.5281/zenodo.7253910。

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验