Capela João, Zimmermann-Kogadeeva Maria, Dijk Aalt D J van, de Ridder Dick, Dias Oscar, Rocha Miguel
Centre of Biological Engineering, University of Minho, Braga, 4710-057, Portugal.
Genome Biology Unit, European Molecular Biology Laboratory, Heidelberg, Germany.
BMC Bioinformatics. 2025 Feb 27;26(1):68. doi: 10.1186/s12859-025-06081-9.
Protein large language models (LLM) have been used to extract representations of enzyme sequences to predict their function, which is encoded by enzyme commission (EC) numbers. However, a comprehensive comparison of different LLMs for this task is still lacking, leaving questions about their relative performance. Moreover, protein sequence alignments (e.g. BLASTp or DIAMOND) are often combined with machine learning models to assign EC numbers from homologous enzymes, thus compensating for the shortcomings of these models' predictions. In this context, LLMs and sequence alignment methods have not been extensively compared as individual predictors, raising unaddressed questions about LLMs' performance and limitations relative to the alignment methods. In this study, we set out to assess the performance of ESM2, ESM1b, and ProtBERT language models in their ability to predict EC numbers, comparing them with BLASTp, against each other and against models that rely on one-hot encodings of amino acid sequences.
Our findings reveal that combining these LLMs with fully connected neural networks surpasses the performance of deep learning models that rely on one-hot encodings. Moreover, although BLASTp provided marginally better results overall, DL models provide results that complement BLASTp's, revealing that LLMs better predict certain EC numbers while BLASTp excels in predicting others. The ESM2 stood out as the best model among the LLMs tested, providing more accurate predictions on difficult annotation tasks and for enzymes without homologs.
Crucially, this study demonstrates that LLMs still have to be improved to become the gold standard tool over BLASTp in mainstream enzyme annotation routines. On the other hand, LLMs can provide good predictions for more difficult-to-annotate enzymes, particularly when the identity between the query sequence and the reference database falls below 25%. Our results reinforce the claim that BLASTp and LLM models complement each other and can be more effective when used together.
蛋白质大语言模型(LLM)已被用于提取酶序列的表征以预测其功能,酶的功能由酶委员会(EC)编号编码。然而,对于该任务,不同大语言模型的全面比较仍然缺乏,这使得它们的相对性能存在疑问。此外,蛋白质序列比对(如BLASTp或DIAMOND)通常与机器学习模型相结合,以便从同源酶中分配EC编号,从而弥补这些模型预测的不足。在此背景下,大语言模型和序列比对方法尚未作为单独的预测器进行广泛比较,这就引发了关于大语言模型相对于比对方法的性能和局限性的未解决问题。在本研究中,我们着手评估ESM2、ESM1b和ProtBERT语言模型预测EC编号的能力,将它们与BLASTp进行比较,相互之间进行比较,并与依赖氨基酸序列独热编码的模型进行比较。
我们的研究结果表明,将这些大语言模型与全连接神经网络相结合,其性能超过了依赖独热编码的深度学习模型。此外,虽然总体上BLASTp的结果略好,但深度模型提供的结果可以补充BLASTp的结果,这表明大语言模型在预测某些EC编号方面表现更好,而BLASTp在预测其他编号方面表现出色。在测试的大语言模型中,ESM2脱颖而出,成为最佳模型,在困难的注释任务以及针对没有同源物的酶的预测方面提供了更准确的结果。
至关重要的是,本研究表明,在主流酶注释程序中,大语言模型仍需改进,才能成为优于BLASTp的金标准工具。另一方面,大语言模型可以为更难注释的酶提供良好的预测,特别是当查询序列与参考数据库之间的一致性低于25%时。我们的结果强化了这样一种观点,即BLASTp和大语言模型相互补充,一起使用时可能更有效。