Innovation Institute for Artificial Intelligence in Medicine of Zhejiang University, College of Pharmaceutical Sciences, Zhejiang University, Hangzhou, 310058, Zhejiang, China.
Neher's Biophysics Laboratory for Innovative Drug Discovery, State Key Laboratory of Quality Research in Chinese Medicine, Macau Institute for Applied Research in Medicine and Health, Macau University of Science and Technology, Macao, 999078, China.
Nat Commun. 2024 Aug 27;15(1):7348. doi: 10.1038/s41467-024-51511-6.
Annotating active sites in enzymes is crucial for advancing multiple fields including drug discovery, disease research, enzyme engineering, and synthetic biology. Despite the development of numerous automated annotation algorithms, a significant trade-off between speed and accuracy limits their large-scale practical applications. We introduce EasIFA, an enzyme active site annotation algorithm that fuses latent enzyme representations from the Protein Language Model and 3D structural encoder, and then aligns protein-level information with the knowledge of enzymatic reactions using a multi-modal cross-attention framework. EasIFA outperforms BLASTp with a 10-fold speed increase and improved recall, precision, f1 score, and MCC by 7.57%, 13.08%, 9.68%, and 0.1012, respectively. It also surpasses empirical-rule-based algorithm and other state-of-the-art deep learning annotation method based on PSSM features, achieving a speed increase ranging from 650 to 1400 times while enhancing annotation quality. This makes EasIFA a suitable replacement for conventional tools in both industrial and academic settings. EasIFA can also effectively transfer knowledge gained from coarsely annotated enzyme databases to smaller, high-precision datasets, highlighting its ability to model sparse and high-quality databases. Additionally, EasIFA shows potential as a catalytic site monitoring tool for designing enzymes with desired functions beyond their natural distribution.
注释酶的活性位点对于推进多个领域至关重要,包括药物发现、疾病研究、酶工程和合成生物学。尽管已经开发了许多自动化注释算法,但在速度和准确性之间存在显著的权衡,限制了它们的大规模实际应用。我们引入了 EasIFA,这是一种酶活性位点注释算法,它融合了蛋白质语言模型和 3D 结构编码器的潜在酶表示,然后使用多模态交叉注意框架将蛋白质水平的信息与酶反应的知识对齐。EasIFA 与 BLASTp 相比,速度提高了 10 倍,召回率、精度、f1 得分和 MCC 分别提高了 7.57%、13.08%、9.68%和 0.1012。它还超越了基于经验规则的算法和其他基于 PSSM 特征的最先进的深度学习注释方法,速度提高了 650 到 1400 倍,同时提高了注释质量。这使得 EasIFA 成为工业和学术环境中传统工具的合适替代品。EasIFA 还可以有效地将从粗注释酶数据库中获得的知识转移到更小、高精度的数据集,突出了其对稀疏和高质量数据库进行建模的能力。此外,EasIFA 作为一种催化位点监测工具,具有在其自然分布之外设计具有所需功能的酶的潜力。