Informatics Institute, Vrije Universiteit, 1081 HV, Amsterdam, The Netherlands.
Sci Rep. 2022 Sep 26;12(1):16047. doi: 10.1038/s41598-022-19608-4.
Self-supervised language modeling is a rapidly developing approach for the analysis of protein sequence data. However, work in this area is heterogeneous and diverse, making comparison of models and methods difficult. Moreover, models are often evaluated only on one or two downstream tasks, making it unclear whether the models capture generally useful properties. We introduce the ProteinGLUE benchmark for the evaluation of protein representations: a set of seven per-amino-acid tasks for evaluating learned protein representations. We also offer reference code, and we provide two baseline models with hyperparameters specifically trained for these benchmarks. Pre-training was done on two tasks, masked symbol prediction and next sentence prediction. We show that pre-training yields higher performance on a variety of downstream tasks such as secondary structure and protein interaction interface prediction, compared to no pre-training. However, the larger base model does not outperform the smaller medium model. We expect the ProteinGLUE benchmark dataset introduced here, together with the two baseline pre-trained models and their performance evaluations, to be of great value to the field of protein sequence-based property prediction. Availability: code and datasets from https://github.com/ibivu/protein-glue .
自监督语言模型是一种用于分析蛋白质序列数据的快速发展的方法。然而,该领域的工作具有异质性和多样性,使得模型和方法的比较变得困难。此外,模型通常仅在一两个下游任务上进行评估,因此不清楚模型是否捕获了普遍有用的特性。我们引入了 ProteinGLUE 基准测试来评估蛋白质表示:一组七个基于每个氨基酸的任务,用于评估学习到的蛋白质表示。我们还提供了参考代码,并提供了两个针对这些基准测试专门训练的超参数基线模型。预训练是在两个任务上完成的,掩蔽符号预测和下一个句子预测。我们表明,与无预训练相比,预训练在各种下游任务(如二级结构和蛋白质相互作用界面预测)上可获得更高的性能。然而,较大的基础模型并不优于较小的中型模型。我们希望这里引入的 ProteinGLUE 基准测试数据集,以及两个经过预训练的基线模型及其性能评估,将对基于蛋白质序列的属性预测领域具有重要价值。
代码和数据集可从 https://github.com/ibivu/protein-glue 获得。