基于结构的蛋白质自监督学习。

Structure-aware protein self-supervised learning.

机构信息

School of Computer Science, McGill University, 845 Rue Sherbrooke O, Montreal, Quebec H3A 0G4, Canada.

MILA-Quebec AI Institute, 6666 Rue Saint-Urbain, Montreal, Quebec H2S 3H1, Canada.

出版信息

Bioinformatics. 2023 Apr 3;39(4). doi: 10.1093/bioinformatics/btad189.

DOI:10.1093/bioinformatics/btad189

PMID:37052532

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC10139775/

Abstract

MOTIVATION

Protein representation learning methods have shown great potential to many downstream tasks in biological applications. A few recent studies have demonstrated that the self-supervised learning is a promising solution to addressing insufficient labels of proteins, which is a major obstacle to effective protein representation learning. However, existing protein representation learning is usually pretrained on protein sequences without considering the important protein structural information.

RESULTS

In this work, we propose a novel structure-aware protein self-supervised learning method to effectively capture structural information of proteins. In particular, a graph neural network model is pretrained to preserve the protein structural information with self-supervised tasks from a pairwise residue distance perspective and a dihedral angle perspective, respectively. Furthermore, we propose to leverage the available protein language model pretrained on protein sequences to enhance the self-supervised learning. Specifically, we identify the relation between the sequential information in the protein language model and the structural information in the specially designed graph neural network model via a novel pseudo bi-level optimization scheme. We conduct experiments on three downstream tasks: the binary classification into membrane/non-membrane proteins, the location classification into 10 cellular compartments, and the enzyme-catalyzed reaction classification into 384 EC numbers, and these experiments verify the effectiveness of our proposed method.

AVAILABILITY AND IMPLEMENTATION

The Alphafold2 database is available in https://alphafold.ebi.ac.uk/. The PDB files are available in https://www.rcsb.org/. The downstream tasks are available in https://github.com/phermosilla/IEConv\_proteins/tree/master/Datasets. The code of the proposed method is available in https://github.com/GGchen1997/STEPS_Bioinformatics.

摘要

动机

蛋白质表示学习方法在生物应用的许多下游任务中显示出巨大的潜力。最近的一些研究表明，自监督学习是解决蛋白质标签不足的一个很有前途的解决方案，这是有效蛋白质表示学习的主要障碍。然而，现有的蛋白质表示学习通常是在没有考虑蛋白质重要结构信息的情况下对蛋白质序列进行预训练的。

结果

在这项工作中，我们提出了一种新的结构感知蛋白质自监督学习方法，以有效地捕捉蛋白质的结构信息。具体来说，我们分别从残基对距离和二面角的角度，使用图神经网络模型进行预训练，以保留蛋白质的结构信息。此外，我们还提出利用已有的基于蛋白质序列的蛋白质语言模型来增强自监督学习。具体来说，我们通过一种新的伪双层优化方案，确定了蛋白质语言模型中的序列信息和专门设计的图神经网络模型中的结构信息之间的关系。我们在三个下游任务上进行了实验：二分类成膜/非膜蛋白、十类细胞区室的位置分类、以及 384 个 EC 编号的酶催化反应分类，这些实验验证了我们提出的方法的有效性。

可用性和实现

Alphafold2 数据库可在 https://alphafold.ebi.ac.uk/ 获取。PDB 文件可在 https://www.rcsb.org/ 获取。下游任务可在 https://github.com/phermosilla/IEConv_proteins/tree/master/Datasets 获取。所提出方法的代码可在 https://github.com/GGchen1997/STEPS_Bioinformatics 获取。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/dad6/10139775/a19d3d056b2d/btad189f1.jpg

相似文献

Structure-aware protein self-supervised learning.

Bioinformatics. 2023 Apr 3;39(4). doi: 10.1093/bioinformatics/btad189.

ProteinMAE: masked autoencoder for protein surface self-supervised learning.

Bioinformatics. 2023 Dec 1;39(12). doi: 10.1093/bioinformatics/btad724.

CS-CO: A Hybrid Self-Supervised Visual Representation Learning Method for H&E-stained Histopathological Images.

Med Image Anal. 2022 Oct;81:102539. doi: 10.1016/j.media.2022.102539. Epub 2022 Jul 20.

Self-supervised driven consistency training for annotation efficient histopathology image analysis.

Med Image Anal. 2022 Jan;75:102256. doi: 10.1016/j.media.2021.102256. Epub 2021 Oct 13.

Powerful molecule generation with simple ConvNet.

Bioinformatics. 2022 Jun 27;38(13):3438-3443. doi: 10.1093/bioinformatics/btac332.

TripletProt: Deep Representation Learning of Proteins Based On Siamese Networks.

IEEE/ACM Trans Comput Biol Bioinform. 2022 Nov-Dec;19(6):3744-3753. doi: 10.1109/TCBB.2021.3108718. Epub 2022 Dec 8.

Affinity Attention Graph Neural Network for Weakly Supervised Semantic Segmentation.

IEEE Trans Pattern Anal Mach Intell. 2022 Nov;44(11):8082-8096. doi: 10.1109/TPAMI.2021.3083269. Epub 2022 Oct 4.

Transformer-based unsupervised contrastive learning for histopathological image classification.

Med Image Anal. 2022 Oct;81:102559. doi: 10.1016/j.media.2022.102559. Epub 2022 Jul 30.

Self-supervised Learning for DNA sequences with circular dilated convolutional networks.

Neural Netw. 2024 Mar;171:466-473. doi: 10.1016/j.neunet.2023.12.002. Epub 2023 Dec 2.

MGLNN: Semi-supervised learning via Multiple Graph Cooperative Learning Neural Networks.

Neural Netw. 2022 Sep;153:204-214. doi: 10.1016/j.neunet.2022.05.024. Epub 2022 Jun 3.

引用本文的文献

SALM: Sequence-Structure Pre-trained Large Language Model for Comprehensive Antibody Representation Learning.

Research (Wash D C). 2025 Aug 19;8:0721. doi: 10.34133/research.0721. eCollection 2025.

MESM: integrating multi-source data for high-accuracy protein-protein interactions prediction through multimodal language models.

BMC Biol. 2025 Aug 11;23(1):253. doi: 10.1186/s12915-025-02356-y.

AI4Protein: transforming the future of protein design.

Sci China Life Sci. 2025 Jun 20. doi: 10.1007/s11427-024-2906-3.

Identifying therapeutic target genes for diabetic retinopathy using systematic druggable genome-wide Mendelian randomization.

Diabetol Metab Syndr. 2025 Apr 29;17(1):145. doi: 10.1186/s13098-025-01710-y.

Accurate Predictions of Molecular Properties of Proteins via Graph Neural Networks and Transfer Learning.

J Chem Theory Comput. 2025 May 13;21(9):4830-4845. doi: 10.1021/acs.jctc.4c01682. Epub 2025 Apr 24.

Fast protein structure searching using structure graph embeddings.

Bioinform Adv. 2024 Mar 5;5(1):vbaf042. doi: 10.1093/bioadv/vbaf042. eCollection 2025.

Using pretrained graph neural networks with token mixers as geometric featurizers for conformational dynamics.

J Chem Phys. 2025 Jan 28;162(4). doi: 10.1063/5.0244453.

Using pretrained graph neural networks with token mixers as geometric featurizers for conformational dynamics.

ArXiv. 2024 Dec 31:arXiv:2409.19838v2.

ProteinF3S: boosting enzyme function prediction by fusing protein sequence, structure, and surface.

Brief Bioinform. 2024 Nov 22;26(1). doi: 10.1093/bib/bbae695.

Protein engineering in the deep learning era.

mLife. 2024 Dec 26;3(4):477-491. doi: 10.1002/mlf2.12157. eCollection 2024 Dec.

本文引用的文献

Highly accurate protein structure prediction with AlphaFold.

Nature. 2021 Aug;596(7873):583-589. doi: 10.1038/s41586-021-03819-2. Epub 2021 Jul 15.

Learning the protein language: Evolution, structure, and function.

Cell Syst. 2021 Jun 16;12(6):654-669.e3. doi: 10.1016/j.cels.2021.05.017.

Structure-based protein function prediction using graph convolutional networks.

Nat Commun. 2021 May 26;12(1):3168. doi: 10.1038/s41467-021-23303-9.

Meta-Learning in Neural Networks: A Survey.

IEEE Trans Pattern Anal Mach Intell. 2022 Sep;44(9):5149-5169. doi: 10.1109/TPAMI.2021.3079209. Epub 2022 Aug 4.

Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences.

Proc Natl Acad Sci U S A. 2021 Apr 13;118(15). doi: 10.1073/pnas.2016239118.

Evaluating Protein Transfer Learning with TAPE.

Adv Neural Inf Process Syst. 2019 Dec;32:9689-9701.

Revolutionary cryo-EM is taking over structural biology.

Nature. 2020 Feb;578(7794):201. doi: 10.1038/d41586-020-00341-9.

DeepSF: deep convolutional neural network for mapping protein sequences to folds.

Bioinformatics. 2018 Apr 15;34(8):1295-1303. doi: 10.1093/bioinformatics/btx780.

DeepLoc: prediction of protein subcellular localization using deep learning.

Bioinformatics. 2017 Dec 15;33(24):4049. doi: 10.1093/bioinformatics/btx548.

A large-scale evaluation of computational protein function prediction.

Nat Methods. 2013 Mar;10(3):221-7. doi: 10.1038/nmeth.2340. Epub 2013 Jan 27.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

基于结构的蛋白质自监督学习。

Structure-aware protein self-supervised learning.

机构信息

School of Computer Science, McGill University, 845 Rue Sherbrooke O, Montreal, Quebec H3A 0G4, Canada.

MILA-Quebec AI Institute, 6666 Rue Saint-Urbain, Montreal, Quebec H2S 3H1, Canada.