School of Computer Science, McGill University, 845 Rue Sherbrooke O, Montreal, Quebec H3A 0G4, Canada.
MILA-Quebec AI Institute, 6666 Rue Saint-Urbain, Montreal, Quebec H2S 3H1, Canada.
Bioinformatics. 2023 Apr 3;39(4). doi: 10.1093/bioinformatics/btad189.
Protein representation learning methods have shown great potential to many downstream tasks in biological applications. A few recent studies have demonstrated that the self-supervised learning is a promising solution to addressing insufficient labels of proteins, which is a major obstacle to effective protein representation learning. However, existing protein representation learning is usually pretrained on protein sequences without considering the important protein structural information.
In this work, we propose a novel structure-aware protein self-supervised learning method to effectively capture structural information of proteins. In particular, a graph neural network model is pretrained to preserve the protein structural information with self-supervised tasks from a pairwise residue distance perspective and a dihedral angle perspective, respectively. Furthermore, we propose to leverage the available protein language model pretrained on protein sequences to enhance the self-supervised learning. Specifically, we identify the relation between the sequential information in the protein language model and the structural information in the specially designed graph neural network model via a novel pseudo bi-level optimization scheme. We conduct experiments on three downstream tasks: the binary classification into membrane/non-membrane proteins, the location classification into 10 cellular compartments, and the enzyme-catalyzed reaction classification into 384 EC numbers, and these experiments verify the effectiveness of our proposed method.
The Alphafold2 database is available in https://alphafold.ebi.ac.uk/. The PDB files are available in https://www.rcsb.org/. The downstream tasks are available in https://github.com/phermosilla/IEConv\_proteins/tree/master/Datasets. The code of the proposed method is available in https://github.com/GGchen1997/STEPS_Bioinformatics.
蛋白质表示学习方法在生物应用的许多下游任务中显示出巨大的潜力。最近的一些研究表明,自监督学习是解决蛋白质标签不足的一个很有前途的解决方案,这是有效蛋白质表示学习的主要障碍。然而,现有的蛋白质表示学习通常是在没有考虑蛋白质重要结构信息的情况下对蛋白质序列进行预训练的。
在这项工作中,我们提出了一种新的结构感知蛋白质自监督学习方法,以有效地捕捉蛋白质的结构信息。具体来说,我们分别从残基对距离和二面角的角度,使用图神经网络模型进行预训练,以保留蛋白质的结构信息。此外,我们还提出利用已有的基于蛋白质序列的蛋白质语言模型来增强自监督学习。具体来说,我们通过一种新的伪双层优化方案,确定了蛋白质语言模型中的序列信息和专门设计的图神经网络模型中的结构信息之间的关系。我们在三个下游任务上进行了实验:二分类成膜/非膜蛋白、十类细胞区室的位置分类、以及 384 个 EC 编号的酶催化反应分类,这些实验验证了我们提出的方法的有效性。
Alphafold2 数据库可在 https://alphafold.ebi.ac.uk/ 获取。PDB 文件可在 https://www.rcsb.org/ 获取。下游任务可在 https://github.com/phermosilla/IEConv_proteins/tree/master/Datasets 获取。所提出方法的代码可在 https://github.com/GGchen1997/STEPS_Bioinformatics 获取。