Suppr超能文献

通过分组多任务学习和预训练蛋白质语言模型识别蛋白质-核苷酸结合残基

Identifying Protein-Nucleotide Binding Residues via Grouped Multi-task Learning and Pre-trained Protein Language Models.

作者信息

Wu Jiashun, Liu Yan, Zhang Ying, Wang Xiaoyu, Yan He, Zhu Yiheng, Song Jiangning, Yu Dong-Jun

机构信息

School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing 210094, China.

School of Information Engineering, Yangzhou University, Yangzhou 225100, China.

出版信息

J Chem Inf Model. 2025 Jan 27;65(2):1040-1052. doi: 10.1021/acs.jcim.4c02092. Epub 2025 Jan 9.

Abstract

The accurate identification of protein-nucleotide binding residues is crucial for protein function annotation and drug discovery. Numerous computational methods have been proposed to predict these binding residues, achieving remarkable performance. However, due to the limited availability and high variability of nucleotides, predicting binding residues for diverse nucleotides remains a significant challenge. To address these, we propose NucGMTL, a new grouped deep multi-task learning approach designed for predicting binding residues of all observed nucleotides in the BioLiP database. NucGMTL leverages pre-trained protein language models to generate robust sequence embedding and incorporates multi-scale learning along with scale-based self-attention mechanisms to capture a broader range of feature dependencies. To effectively harness the shared binding patterns across various nucleotides, deep multi-task learning is utilized to distill common representations, taking advantage of auxiliary information from similar nucleotides selected based on task grouping. Performance evaluation on benchmark data sets shows that NucGMTL achieves an average area under the Precision-Recall curve (AUPRC) of 0.594, surpassing other state-of-the-art methods. Further analyses highlight that the predominant advantage of NucGMTL can be reflected by its effective integration of grouped multi-task learning and pre-trained protein language models. The data set and source code are freely accessible at: https://github.com/jerry1984Y/NucGMTL.

摘要

准确识别蛋白质 - 核苷酸结合残基对于蛋白质功能注释和药物发现至关重要。已经提出了许多计算方法来预测这些结合残基,并取得了显著的性能。然而,由于核苷酸的可用性有限且变异性高,预测不同核苷酸的结合残基仍然是一项重大挑战。为了解决这些问题,我们提出了NucGMTL,这是一种新的分组深度多任务学习方法,旨在预测BioLiP数据库中所有观察到的核苷酸的结合残基。NucGMTL利用预训练的蛋白质语言模型生成强大的序列嵌入,并结合多尺度学习以及基于尺度的自注意力机制来捕获更广泛的特征依赖关系。为了有效利用各种核苷酸之间共享的结合模式,利用深度多任务学习来提取共同表示,利用基于任务分组选择的相似核苷酸的辅助信息。在基准数据集上的性能评估表明,NucGMTL的精确召回曲线下面积(AUPRC)平均达到0.594,超过了其他现有方法。进一步分析表明,NucGMTL的主要优势可以通过其对分组多任务学习和预训练蛋白质语言模型的有效整合来体现。数据集和源代码可在以下网址免费获取:https://github.com/jerry1984Y/NucGMTL

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验