Institute for Interdisciplinary Information Sciences, Tsinghua University, China.
Pac Symp Biocomput. 2023;28:109-120.
Although protein sequence data is growing at an ever-increasing rate, the protein universe is still sparsely annotated with functional and structural annotations. Computational approaches have become efficient solutions to infer annotations for unlabeled proteins by transferring knowledge from proteins with experimental annotations. Despite the increasing availability of protein structure data and the high coverage of high-quality predicted structures, e.g., by AlphaFold, many existing computational tools still only rely on sequence data to predict structural or functional annotations, including alignment algorithms such as BLAST and several sequence-based deep learning models. Here, we develop PenLight, a general deep learning framework for protein structural and functional annotations. Pen-Light uses a graph neural network (GNN) to integrate 3D protein structure data and protein language model representations. In addition, PenLight applies a contrastive learning strategy to train the GNN for learning protein representations that reflect similarities beyond sequence identity, such as semantic similarities in the function or structure space. We benchmarked PenLight on a structural classification task and a functional annotation task, where PenLight achieved higher prediction accuracy and coverage than state-of-the-art methods.
尽管蛋白质序列数据的增长速度越来越快,但蛋白质领域的功能和结构注释仍然很少。通过从具有实验注释的蛋白质转移知识,计算方法已成为推断未标记蛋白质注释的有效解决方案。尽管蛋白质结构数据的可用性不断增加,并且高质量预测结构(例如 AlphaFold)的覆盖率很高,但许多现有的计算工具仍然仅依赖序列数据来预测结构或功能注释,包括对齐算法,例如 BLAST 和几个基于序列的深度学习模型。在这里,我们开发了 PenLight,这是一个用于蛋白质结构和功能注释的通用深度学习框架。Pen-Light 使用图神经网络 (GNN) 来整合 3D 蛋白质结构数据和蛋白质语言模型表示。此外,PenLight 应用对比学习策略来训练 GNN,以学习反映超越序列同一性的相似性的蛋白质表示,例如功能或结构空间中的语义相似性。我们在结构分类任务和功能注释任务上对 PenLight 进行了基准测试,PenLight 的预测准确性和覆盖率均高于最先进的方法。