Wang Han, Ren Zilin, Sun Jinghong, Chen Yongbing, Bo Xiaochen, Xue JiGuo, Gao Jingyang, Ni Ming
College of Information Science and Technology, Beijing University of Chemical Technology, No. 15 North Third Ring East Road, Chaoyang District, Beijing 100029, China.
Changchun Veterinary Research Institute, Chinese Academy of Agricultural Sciences, State Key Laboratory of Pathogen and Biosecurity, Key Laboratory of Jilin Province for Zoonosis Prevention and Control, Changchun 130122, China.
Brief Bioinform. 2024 Nov 22;26(1). doi: 10.1093/bib/bbae579.
Deriving protein function from protein sequences poses a significant challenge due to the intricate relationship between sequence and function. Deep learning has made remarkable strides in predicting sequence-function relationships. However, models tailored for specific tasks or protein types encounter difficulties when using transfer learning across domains. This is attributed to the fact that protein function relies heavily on structural characteristics rather than mere sequence information. Consequently, there is a pressing need for a model capable of capturing shared features among diverse sequence-function mapping tasks to address the generalization issue. In this study, we explore the potential of Model-Agnostic Meta-Learning combined with a protein language model called Evolutionary Scale Modeling to tackle this challenge. Our approach involves training the architecture on five out-domain deep mutational scanning (DMS) datasets and evaluating its performance across four key dimensions. Our findings demonstrate that the proposed architecture exhibits satisfactory performance in terms of generalization and employs an effective few-shot learning strategy. To explain further, Compared to the best results, the Pearson's correlation coefficient (PCC) in the final stage increased by ~0.31%. Furthermore, we leverage the trained architecture to predict binding affinity scores of the DMS dataset of SARS-CoV-2 using transfer learning. Notably, training on a subset of the Ube4b dataset with 500 samples resulted in a notable improvement of 0.11 in the PCC. These results underscore the potential of our conceptual architecture as a promising methodology for multi-task protein function prediction.
由于序列与功能之间存在复杂的关系,从蛋白质序列推导蛋白质功能面临重大挑战。深度学习在预测序列-功能关系方面取得了显著进展。然而,针对特定任务或蛋白质类型定制的模型在跨领域使用迁移学习时会遇到困难。这归因于蛋白质功能严重依赖于结构特征而非仅仅是序列信息。因此,迫切需要一种能够捕捉不同序列-功能映射任务之间共享特征的模型来解决泛化问题。在本研究中,我们探索了模型无关元学习与一种名为进化尺度建模的蛋白质语言模型相结合来应对这一挑战的潜力。我们的方法包括在五个域外深度突变扫描(DMS)数据集上训练该架构,并在四个关键维度上评估其性能。我们的研究结果表明,所提出的架构在泛化方面表现出令人满意的性能,并采用了有效的少样本学习策略。进一步解释,与最佳结果相比,最终阶段的皮尔逊相关系数(PCC)提高了约0.31%。此外,我们利用训练好的架构通过迁移学习预测SARS-CoV-2的DMS数据集的结合亲和力分数。值得注意的是,在包含500个样本的Ube4b数据集的一个子集上进行训练,导致PCC显著提高了0.11。这些结果强调了我们的概念性架构作为一种有前途的多任务蛋白质功能预测方法的潜力。