Zhao Wendi, Han Qiaoling, Yang Fan, Zhao Yue
School of Technology, Beijing Forestry University, Beijing, China.
Key Lab of State Forestry Administration for Forestry Equipment and Automation, Beijing, China.
Proteins. 2025 Sep;93(9):1507-1517. doi: 10.1002/prot.26822. Epub 2025 Apr 2.
The accurate prediction of enzyme function is crucial for elucidating disease mechanisms and identifying drug targets. Nevertheless, existing enzyme commission (EC) number prediction methods are limited by database coverage and the depth of sequence information mining, hindering the efficiency and precision of enzyme function annotation. Therefore, this study introduces ProteEC-CLA (Protein EC number prediction model with Contrastive Learning and Agent Attention). ProteEC-CLA utilizes contrastive learning to construct positive and negative sample pairs, which not only enhances sequence feature extraction but also improves the utilization of unlabeled data. This process helps the model learn the differences in sequence features, thereby enhancing its ability to predict enzyme function. Integrating the pre-trained protein language model ESM2, the model generates informative sequence embeddings for deep functional correlation analysis, significantly enhancing prediction accuracy. With the incorporation of the Agent Attention mechanism, ProteEC-CLA's ability to comprehensively capture local details and global features is enhanced, ensuring high-accuracy predictions on complex sequences. The results demonstrate that ProteEC-CLA performs exceptionally well on two independent and representative datasets. In the standard dataset, it achieves 98.92% accuracy at the EC4 level. In the more challenging clustered split dataset, ProteEC-CLA achieves 93.34% accuracy and an F1-score of 94.72%. With only enzyme sequences as input, ProteEC-CLA can accurately predict EC numbers up to the fourth level, significantly enhancing annotation efficiency and accuracy, which makes it a highly efficient and precise functional annotation tool for enzymology research and applications.
准确预测酶的功能对于阐明疾病机制和识别药物靶点至关重要。然而,现有的酶委员会(EC)编号预测方法受到数据库覆盖范围和序列信息挖掘深度的限制,阻碍了酶功能注释的效率和精度。因此,本研究引入了ProteEC-CLA(具有对比学习和智能体注意力的蛋白质EC编号预测模型)。ProteEC-CLA利用对比学习构建正样本和负样本对,这不仅增强了序列特征提取,还提高了未标记数据的利用率。这一过程有助于模型学习序列特征的差异,从而增强其预测酶功能的能力。该模型整合了预训练的蛋白质语言模型ESM2,生成用于深度功能相关性分析的信息丰富的序列嵌入,显著提高了预测准确性。通过引入智能体注意力机制,ProteEC-CLA全面捕捉局部细节和全局特征的能力得到增强,确保对复杂序列进行高精度预测。结果表明,ProteEC-CLA在两个独立且具有代表性的数据集上表现出色。在标准数据集中,它在EC4水平上的准确率达到98.92%。在更具挑战性的聚类分割数据集中,ProteEC-CLA的准确率达到93.34%,F1分数为94.72%。仅以酶序列作为输入,ProteEC-CLA就能准确预测到第四级的EC编号,显著提高注释效率和准确性,使其成为酶学研究和应用中一种高效且精确的功能注释工具。