Zhao Chenguang, Liu Tong, Wang Zheng
Computer and Information Sciences Department, St. Ambrose University, 518 W Locust St, Davenport, IA 52803, USA.
Department of Computer Science, University of Miami, 1365 Memorial Drive, Coral Gables, FL 33124, USA.
NAR Genom Bioinform. 2024 Aug 6;6(3):lqae094. doi: 10.1093/nargab/lqae094. eCollection 2024 Sep.
Previous protein function predictors primarily make predictions from amino acid sequences instead of tertiary structures because of the limited number of experimentally determined structures and the unsatisfying qualities of predicted structures. AlphaFold recently achieved promising performances when predicting protein tertiary structures, and the AlphaFold protein structure database (AlphaFold DB) is fast-expanding. Therefore, we aimed to develop a deep-learning tool that is specifically trained with AlphaFold models and predict GO terms from AlphaFold models. We developed an advanced learning architecture by combining geometric vector perceptron graph neural networks and variant transformer decoder layers for multi-label classification. PANDA-3D predicts gene ontology (GO) terms from the predicted structures of AlphaFold and the embeddings of amino acid sequences based on a large language model. Our method significantly outperformed a state-of-the-art deep-learning method that was trained with experimentally determined tertiary structures, and either outperformed or was comparable with several other language-model-based state-of-the-art methods with amino acid sequences as input. PANDA-3D is tailored to AlphaFold models, and the AlphaFold DB currently contains over 200 million predicted protein structures (as of May 1st, 2023), making PANDA-3D a useful tool that can accurately annotate the functions of a large number of proteins. PANDA-3D can be freely accessed as a web server from http://dna.cs.miami.edu/PANDA-3D/ and as a repository from https://github.com/zwang-bioinformatics/PANDA-3D.
由于实验确定的结构数量有限以及预测结构的质量不尽人意,以前的蛋白质功能预测器主要根据氨基酸序列而非三级结构进行预测。AlphaFold最近在预测蛋白质三级结构时取得了令人瞩目的性能,并且AlphaFold蛋白质结构数据库(AlphaFold DB)正在迅速扩展。因此,我们旨在开发一种专门使用AlphaFold模型进行训练的深度学习工具,并从AlphaFold模型中预测基因本体(GO)术语。我们通过结合几何向量感知器图神经网络和用于多标签分类的可变变压器解码器层,开发了一种先进的学习架构。PANDA-3D基于大语言模型,从AlphaFold的预测结构和氨基酸序列的嵌入中预测基因本体(GO)术语。我们的方法显著优于一种使用实验确定的三级结构进行训练的先进深度学习方法,并且在以氨基酸序列作为输入时,要么优于其他几种基于语言模型的先进方法,要么与之相当。PANDA-3D是为AlphaFold模型量身定制的,而AlphaFold DB目前包含超过2亿个预测的蛋白质结构(截至2023年5月1日),这使得PANDA-3D成为一种可以准确注释大量蛋白质功能的有用工具。可以通过http://dna.cs.miami.edu/PANDA-3D/作为网络服务器免费访问PANDA-3D,也可以从https://github.com/zwang-bioinformatics/PANDA-3D作为存储库进行访问。