Digital Medical Research Center, School of Basic Medical Sciences, Fudan University, Shanghai 200032, China.
Shanghai Key Laboratory of Medical Image Computing and Computer Assisted Intervention, Fudan University, Shanghai 200032, China.
Bioinformatics. 2023 Dec 1;39(12). doi: 10.1093/bioinformatics/btad724.
The biological functions of proteins are determined by the chemical and geometric properties of their surfaces. Recently, with the booming progress of deep learning, a series of learning-based surface descriptors have been proposed and achieved inspirational performance in many tasks such as protein design, protein-protein interaction prediction, etc. However, they are still limited by the problem of label scarcity, since the labels are typically obtained through wet experiments. Inspired by the great success of self-supervised learning in natural language processing and computer vision, we introduce ProteinMAE, a self-supervised framework specifically designed for protein surface representation to mitigate label scarcity. Specifically, we propose an efficient network and utilize a large number of accessible unlabeled protein data to pretrain it by self-supervised learning. Then we use the pretrained weights as initialization and fine-tune the network on downstream tasks. To demonstrate the effectiveness of our method, we conduct experiments on three different downstream tasks including binding site identification in protein surface, ligand-binding protein pocket classification, and protein-protein interaction prediction. The extensive experiments show that our method not only successfully improves the network's performance on all downstream tasks, but also achieves competitive performance with state-of-the-art methods. Moreover, our proposed network also exhibits significant advantages in terms of computational cost, which only requires less than a tenth of memory cost of previous methods.
蛋白质的生物学功能取决于其表面的化学和几何性质。最近,随着深度学习的蓬勃发展,一系列基于学习的表面描述符已经被提出,并在蛋白质设计、蛋白质-蛋白质相互作用预测等许多任务中取得了令人鼓舞的性能。然而,它们仍然受到标签稀缺的问题的限制,因为标签通常是通过湿实验获得的。受自然语言处理和计算机视觉中自监督学习的巨大成功的启发,我们引入了 ProteinMAE,这是一个专门为蛋白质表面表示设计的自监督框架,以减轻标签稀缺的问题。具体来说,我们提出了一种有效的网络,并利用大量可访问的未标记蛋白质数据通过自监督学习对其进行预训练。然后,我们使用预训练的权重作为初始化,并在下游任务上对网络进行微调。为了证明我们方法的有效性,我们在三个不同的下游任务上进行了实验,包括蛋白质表面结合位点识别、配体结合蛋白口袋分类和蛋白质-蛋白质相互作用预测。广泛的实验表明,我们的方法不仅成功地提高了网络在所有下游任务上的性能,而且还达到了最先进方法的竞争性能。此外,我们提出的网络在计算成本方面也具有显著的优势,仅需要以前方法十分之一以下的内存成本。