School of Life Sciences, Northeast Agricultural University, Harbin, 150030, China.
State Key Laboratory of Membrane Biology, Institute of Zoology, Chinese Academy of Sciences, Beijing, 100101, China.
BMC Bioinformatics. 2022 Nov 8;23(1):467. doi: 10.1186/s12859-022-05031-z.
Pre-trained natural language processing models on a large natural language corpus can naturally transfer learned knowledge to protein domains by fine-tuning specific in-domain tasks. However, few studies focused on enriching such protein language models by jointly learning protein properties from strongly-correlated protein tasks. Here we elaborately designed a multi-task learning (MTL) architecture, aiming to decipher implicit structural and evolutionary information from three sequence-level classification tasks for protein family, superfamily and fold. Considering the co-existing contextual relevance between human words and protein language, we employed BERT, pre-trained on a large natural language corpus, as our backbone to handle protein sequences. More importantly, the encoded knowledge obtained in the MTL stage can be well transferred to more fine-grained downstream tasks of TAPE. Experiments on structure- or evolution-related applications demonstrate that our approach outperforms many state-of-the-art Transformer-based protein models, especially in remote homology detection.
在大型自然语言语料库上进行预训练的自然语言处理模型可以通过微调特定的领域任务,自然地将学到的知识转移到蛋白质域。然而,很少有研究关注通过从强相关的蛋白质任务中联合学习蛋白质特性来丰富这种蛋白质语言模型。在这里,我们精心设计了一种多任务学习 (MTL) 架构,旨在从三个序列级分类任务(蛋白质家族、超家族和折叠)中解析隐含的结构和进化信息。考虑到人类单词和蛋白质语言之间存在共存的上下文相关性,我们采用了在大型自然语言语料库上进行预训练的 BERT 作为我们的骨干来处理蛋白质序列。更重要的是,在 MTL 阶段获得的编码知识可以很好地转移到 TAPE 的更精细的下游任务中。在与结构或进化相关的应用程序上的实验表明,我们的方法优于许多最先进的基于 Transformer 的蛋白质模型,特别是在远程同源检测方面。