Zeng Zishuo, Guo Jin, Jin Jiao, Luo Xiaozhou
Synceres Biosciences Co. Ltd., Shenzhen, 518100, China.
Shenzhen Key Laboratory for the Intelligent Microbial Manufacturing of Medicines, Key Laboratory of Quantitative Synthetic Biology, Center for Synthetic Biochemistry, Shenzhen Institute of Synthetic Biology, Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, Shenzhen, 518055, China.
J Cheminform. 2025 Jan 7;17(1):2. doi: 10.1186/s13321-024-00944-8.
Predicting EC numbers for chemical reactions enables efficient enzymatic annotations for computer-aided synthesis planning. However, conventional machine learning approaches encounter challenges due to data scarcity and class imbalance. Here, we introduce CLAIRE (Contrastive Learning-based AnnotatIon for Reaction's EC), a novel framework leveraging contrastive learning, pre-trained language model-based reaction embeddings, and data augmentation to address these limitations. CLAIRE achieved notable performance improvements, demonstrating weighted average F1 scores of 0.861 and 0.911 on the testing set (n = 18,816) and an independent dataset (n = 1040) derived from yeast's metabolic model, respectively. Remarkably, CLAIRE significantly outperformed the state-of-the-art model by 3.65 folds and 1.18 folds, respectively. Its high accuracy positions CLAIRE as a promising tool for retrosynthesis planning, drug fate prediction, and synthetic biology applications. CLAIRE is freely available on GitHub ( https://github.com/zishuozeng/CLAIRE ).Scientific contributionThis work employed contrastive learning for predicting enzymatic reaction's EC numbers, overcoming the challenges in data scarcity and imbalance. The new model achieves the state-of-the-art performance and may facilitate the computer-aided synthesis planning.
预测化学反应的酶委员会(EC)编号有助于为计算机辅助合成规划进行高效的酶注释。然而,由于数据稀缺和类别不平衡,传统的机器学习方法面临挑战。在此,我们引入了CLAIRE(基于对比学习的反应EC注释),这是一个新颖的框架,它利用对比学习、基于预训练语言模型的反应嵌入和数据增强来解决这些限制。CLAIRE取得了显著的性能提升,在测试集(n = 18,816)和从酵母代谢模型衍生的独立数据集(n = 1040)上分别展示了0.861和0.911的加权平均F1分数。值得注意的是,CLAIRE分别比最先进的模型显著高出3.65倍和1.18倍。其高准确性使CLAIRE成为逆合成规划、药物命运预测和合成生物学应用的有前途的工具。CLAIRE可在GitHub(https://github.com/zishuozeng/CLAIRE)上免费获取。科学贡献这项工作采用对比学习来预测酶促反应的EC编号,克服了数据稀缺和不平衡方面的挑战。新模型实现了最先进的性能,并可能促进计算机辅助合成规划。