Shi Zhenkun, Deng Rui, Yuan Qianqian, Mao Zhitao, Wang Ruoyu, Li Haoran, Liao Xiaoping, Ma Hongwu
Biodesign Center, Key Laboratory of Engineering Biology for Low-carbon Manufacturing, Tianjin Institute of Industrial Biotechnology, Chinese Academy of Sciences, 300308, Tianjin, China.
National Center of Technology Innovation for Synthetic Biology, 300308, Tianjin, China.
Research (Wash D C). 2023 May 31;6:0153. doi: 10.34133/research.0153. eCollection 2023.
Enzyme commission (EC) numbers, which associate a protein sequence with the biochemical reactions it catalyzes, are essential for the accurate understanding of enzyme functions and cellular metabolism. Many ab initio computational approaches were proposed to predict EC numbers for given input protein sequences. However, the prediction performance (accuracy, recall, and precision), usability, and efficiency of existing methods decreased seriously when dealing with recently discovered proteins, thus still having much room to be improved. Here, we report HDMLF, a hierarchical dual-core multitask learning framework for accurately predicting EC numbers based on novel deep learning techniques. HDMLF is composed of an embedding core and a learning core; the embedding core adopts the latest protein language model for protein sequence embedding, and the learning core conducts the EC number prediction. Specifically, HDMLF is designed on the basis of a gated recurrent unit framework to perform EC number prediction in the multi-objective hierarchy, multitasking manner. Additionally, we introduced an attention layer to optimize the EC prediction and employed a greedy strategy to integrate and fine-tune the final model. Comparative analyses against 4 representative methods demonstrate that HDMLF stably delivers the highest performance, which improves accuracy and F1 score by 60% and 40% over the state of the art, respectively. An additional case study of tyrB predicted to compensate for the loss of aspartate aminotransferase aspC, as reported in a previous experimental study, shows that our model can also be used to uncover the enzyme promiscuity. Finally, we established a web platform, namely, ECRECer (https://ecrecer.biodesign.ac.cn), using an entirely could-based serverless architecture and provided an offline bundle to improve usability.
酶委员会(EC)编号将蛋白质序列与其催化的生化反应相关联,对于准确理解酶功能和细胞代谢至关重要。人们提出了许多从头计算方法来预测给定输入蛋白质序列的EC编号。然而,现有方法在处理最近发现的蛋白质时,其预测性能(准确率、召回率和精确率)、可用性和效率严重下降,因此仍有很大的改进空间。在此,我们报告了HDMLF,一种基于新型深度学习技术的用于准确预测EC编号的分层双核多任务学习框架。HDMLF由一个嵌入核心和一个学习核心组成;嵌入核心采用最新的蛋白质语言模型进行蛋白质序列嵌入,学习核心进行EC编号预测。具体而言,HDMLF是在门控循环单元框架的基础上设计的,以多目标层次、多任务的方式进行EC编号预测。此外,我们引入了一个注意力层来优化EC预测,并采用贪婪策略来集成和微调最终模型。与4种代表性方法的对比分析表明,HDMLF稳定地提供了最高性能,与现有技术相比,准确率和F1分数分别提高了60%和40%。如先前一项实验研究所报道的,对预测可补偿天冬氨酸转氨酶aspC缺失的tyrB的额外案例研究表明,我们的模型还可用于揭示酶的多效性。最后,我们使用完全基于云的无服务器架构建立了一个网络平台,即ECRECer(https://ecrecer.biodesign.ac.cn),并提供了一个离线包以提高可用性。