Fu Yiwei, Gu Zhonghui, Luo Xiao, Guo Qirui, Lai Luhua, Deng Minghua
School of Mathematical Sciences, Peking University, Beijing 100871, China.
Peking-Tsinghua Center for Life Sciences, Peking University, Beijing 100871, China.
Gigascience. 2024 Jan 2;13. doi: 10.1093/gigascience/giae093.
In the face of a growing disparity between high-throughput sequence data and low-throughput experimental studies, the emerging field of deep learning stands as a promising alternative. Generally, many data-driven approaches are capable of facilitating fast and accurate predictions of protein functions. Nevertheless, the inherent statistical nature of deep learning techniques may limit their generalization capabilities when applied to novel nonhomologous proteins that diverge significantly from existing ones.
In this work, we herein propose a novel, generalized approach named Graph Adversarial Learning with Alignment (GALA) for protein function prediction. Our GALA method integrates a graph transformer architecture with an attention pooling module to extract embeddings from both protein sequences and structures, facilitating unified learning of protein representations. Particularly noteworthy, GALA incorporates a domain discriminator conditioned on both learnable representations and predicted probabilities, which undergoes adversarial learning to ensure representation invariance across diverse environments. To optimize the model with abundant label information, we generate label embeddings in the hidden space, explicitly aligning them with protein representations. Benchmarked on datasets derived from the PDB database and Swiss-Prot database, our GALA achieves considerable performance comparable to several state-of-the-art methods. Even more, GALA demonstrates wonderful biological interpretability by identifying significant functional residues associated with Gene Ontology terms through class activation mapping.
GALA, which leverages adversarial learning and label embedding alignment to acquire domain-invariant protein representations, exhibits outstanding generalizability in function prediction for proteins from previously unseen sequence space. By incorporating the structures predicted by AlphaFold2, GALA demonstrates significant potential for function annotation in newly discovered sequences. A detailed implementation of our GALA is available at https://github.com/fuyw-aisw/GALA.
面对高通量序列数据与低通量实验研究之间日益扩大的差距,深度学习这一新兴领域成为一种有前景的替代方法。一般来说,许多数据驱动的方法能够促进对蛋白质功能的快速准确预测。然而,深度学习技术的固有统计性质可能会限制其在应用于与现有蛋白质有显著差异的新型非同源蛋白质时的泛化能力。
在这项工作中,我们提出了一种名为带比对的图对抗学习(GALA)的新型通用方法用于蛋白质功能预测。我们的GALA方法将图变换器架构与注意力池化模块相结合,从蛋白质序列和结构中提取嵌入,促进蛋白质表示的统一学习。特别值得注意的是,GALA纳入了一个基于可学习表示和预测概率的域判别器,该判别器通过对抗学习来确保跨不同环境的表示不变性。为了利用丰富的标签信息优化模型,我们在隐藏空间中生成标签嵌入,使其与蛋白质表示明确对齐。在源自蛋白质数据银行(PDB)数据库和瑞士蛋白质数据库(Swiss - Prot)的数据集上进行基准测试,我们的GALA取得了与几种最先进方法相当的可观性能。此外,GALA通过类激活映射识别与基因本体论术语相关的重要功能残基,展示了出色的生物学可解释性。
GALA利用对抗学习和标签嵌入对齐来获得域不变的蛋白质表示,在预测来自先前未见序列空间的蛋白质功能时表现出卓越的泛化能力。通过纳入AlphaFold2预测的结构,GALA在新发现序列的功能注释方面显示出巨大潜力。我们的GALA详细实现可在https://github.com/fuyw - aislw/GALA获取。