Talo Muhammed, Bozdag Serdar
Department of Computer Science and Engineering, University of North Texas, Denton, TX 76207, USA.
BioDiscovery Institute, University of North Texas, Denton, TX 76207, USA.
bioRxiv. 2025 May 17:2025.05.13.653854. doi: 10.1101/2025.05.13.653854.
Understanding protein functions facilitates the identification of the underlying causes of many diseases and guides the research for discovering new therapeutic targets and medications. With the advancement of high throughput technologies, obtaining novel protein sequences has been a routine process. However, determining protein functions experimentally is cost- and labor-prohibitive. Therefore, it is crucial to develop computational methods for automatic protein function prediction. In this study, we propose a multi-modal deep learning architecture called ProtFun to predict protein functions. ProtFun integrates protein large language model (LLM) embeddings as node features in a protein family network. Employing graph attention networks (GAT) on this protein family network, ProtFun learns protein embeddings, which are integrated with protein signature representations from InterPro to train a protein function prediction model. We evaluated our architecture using three benchmark datasets. Our results showed that our proposed approach outperformed current state-of-the-art methods for most cases. An ablation study also highlighted the importance of different components of ProtFun. The data and source code of ProtFun is available at https://github.com/bozdaglab/ProtFun under Creative Commons Attribution Non Commercial 4.0 International Public License.
了解蛋白质功能有助于确定许多疾病的潜在病因,并指导寻找新的治疗靶点和药物的研究。随着高通量技术的进步,获取新的蛋白质序列已成为常规过程。然而,通过实验确定蛋白质功能在成本和人力方面都令人望而却步。因此,开发用于自动预测蛋白质功能的计算方法至关重要。在本研究中,我们提出了一种名为ProtFun的多模态深度学习架构来预测蛋白质功能。ProtFun将蛋白质大语言模型(LLM)嵌入作为蛋白质家族网络中的节点特征。在这个蛋白质家族网络上使用图注意力网络(GAT),ProtFun学习蛋白质嵌入,这些嵌入与来自InterPro的蛋白质特征表示相结合,以训练蛋白质功能预测模型。我们使用三个基准数据集评估了我们的架构。我们的结果表明,在大多数情况下,我们提出的方法优于当前的最先进方法。一项消融研究还突出了ProtFun不同组件的重要性。ProtFun的数据和源代码可在https://github.com/bozdaglab/ProtFun上获取,遵循知识共享署名非商业性4.0国际公共许可协议。