Univ Rennes, Inria, CNRS, IRISA-UMR 6074, Rennes 35000, France.
Bioinformatics. 2023 Oct 3;39(10). doi: 10.1093/bioinformatics/btad620.
There is a growing number of available protein sequences, but only a limited amount has been manually annotated. For example, only 0.25% of all entries of UniProtKB are reviewed by human annotators. Further developing automatic tools to infer protein function from sequence alone can alleviate part of this gap. In this article, we investigate the potential of Transformer deep neural networks on a specific case of functional sequence annotation: the prediction of enzymatic classes.
We show that our EnzBert transformer models, trained to predict Enzyme Commission (EC) numbers by specialization of a protein language model, outperforms state-of-the-art tools for monofunctional enzyme class prediction based on sequences only. Accuracy is improved from 84% to 95% on the prediction of EC numbers at level two on the EC40 benchmark. To evaluate the prediction quality at level four, the most detailed level of EC numbers, we built two new time-based benchmarks for comparison with state-of-the-art methods ECPred and DeepEC: the macro-F1 score is respectively improved from 41% to 54% and from 20% to 26%. Finally, we also show that using a simple combination of attention maps is on par with, or better than, other classical interpretability methods on the EC prediction task. More specifically, important residues identified by attention maps tend to correspond to known catalytic sites. Quantitatively, we report a max F-Gain score of 96.05%, while classical interpretability methods reach 91.44% at best.
Source code and datasets are respectively available at https://gitlab.inria.fr/nbuton/tfpc and https://doi.org/10.5281/zenodo.7253910.
可用的蛋白质序列数量不断增加,但只有有限数量的序列经过人工注释。例如,UniProtKB 中的所有条目只有 0.25% 是由人类注释者审查的。进一步开发仅从序列推断蛋白质功能的自动工具可以缓解这一差距的一部分。在本文中,我们研究了 Transformer 深度神经网络在功能序列注释的特定情况下的潜力:酶类的预测。
我们表明,我们的 EnzBert 转换器模型,通过专门化蛋白质语言模型来预测酶委员会(EC)编号,在基于序列的单功能酶类预测方面优于最先进的工具。在 EC40 基准测试的二级 EC 编号预测方面,准确性从 84%提高到 95%。为了评估在最详细的 EC 编号四级的预测质量,我们为与最先进的方法 ECPred 和 DeepEC 进行比较而构建了两个新的基于时间的基准:宏 F1 分数分别从 41%提高到 54%和从 20%提高到 26%。最后,我们还表明,在 EC 预测任务中,使用注意力图的简单组合与其他经典可解释性方法相当,甚至更好。更具体地说,注意力图中识别出的重要残基往往对应于已知的催化位点。在定量方面,我们报告了 96.05%的最大 F-Gain 分数,而经典的可解释性方法的最佳分数为 91.44%。
源代码和数据集分别位于 https://gitlab.inria.fr/nbuton/tfpc 和 https://doi.org/10.5281/zenodo.7253910。