PDNAPred：基于预先训练的蛋白质语言模型的蛋白质-DNA 结合位点的可解释预测。

PDNAPred: Interpretable prediction of protein-DNA binding sites based on pre-trained protein language models.

机构信息

College of Information Technology, Shanghai Ocean University, Shanghai 201306, China.

出版信息

Int J Biol Macromol. 2024 Nov;281(Pt 2):136147. doi: 10.1016/j.ijbiomac.2024.136147. Epub 2024 Oct 1.

DOI:10.1016/j.ijbiomac.2024.136147

Abstract

Protein-DNA interactions play critical roles in various biological processes and are essential for drug discovery. However, traditional experimental methods are labor-intensive and unable to keep pace with the increasing volume of protein sequences, leading to a substantial number of proteins lacking DNA-binding annotations. Therefore, developing an efficient computational method to identify protein-DNA binding sites is crucial. Unfortunately, most existing computational methods rely on manually selected features or protein structure information, making these methods inapplicable to large-scale prediction tasks. In this study, we introduced PDNAPred, a sequence-based method that combines two pre-trained protein language models with a designed CNN-GRU network to identify DNA-binding sites. Additionally, to tackle the issue of imbalanced dataset samples, we employed focal loss. Our comprehensive experiments demonstrated that PDNAPred significantly improved the accuracy of DNA-binding site prediction, outperforming existing state-of-the-art sequence-based methods. Remarkably, PDNAPred also achieved results comparable to advanced structure-based methods. The designed CNN-GRU network enhances its capability to detect DNA-binding sites accurately. Furthermore, we validated the versatility of PDNAPred by training it on RNA-binding site datasets, showing its potential as a general framework for amino acid binding site prediction. Finally, we conducted model interpretability analysis to elucidate the reasons behind PDNAPred's outstanding performance.

摘要

蛋白质与 DNA 的相互作用在各种生物过程中起着至关重要的作用，是药物发现的关键。然而，传统的实验方法繁琐且无法跟上日益增长的蛋白质序列数量，导致大量蛋白质缺乏 DNA 结合注释。因此，开发一种有效的计算方法来识别蛋白质-DNA 结合位点至关重要。不幸的是，大多数现有的计算方法依赖于手动选择的特征或蛋白质结构信息，这使得这些方法不适用于大规模的预测任务。在这项研究中，我们引入了 PDNAPred，这是一种基于序列的方法，它结合了两个预先训练的蛋白质语言模型和一个设计的 CNN-GRU 网络，用于识别 DNA 结合位点。此外，为了解决不平衡数据集样本的问题，我们采用了焦点损失。我们的综合实验表明，PDNAPred 显著提高了 DNA 结合位点预测的准确性，优于现有的基于序列的最先进方法。值得注意的是，PDNAPred 的表现也与先进的基于结构的方法相当。设计的 CNN-GRU 网络增强了其准确检测 DNA 结合位点的能力。此外，我们通过在 RNA 结合位点数据集上训练 PDNAPred 来验证其多功能性，表明其有潜力成为氨基酸结合位点预测的通用框架。最后，我们进行了模型可解释性分析，以阐明 PDNAPred 卓越表现的原因。

Suppr 超能文献

文献检索

文件翻译

深度研究

Suppr 超能文献

文献检索

文件翻译

深度研究

PDNAPred：基于预先训练的蛋白质语言模型的蛋白质-DNA 结合位点的可解释预测。

PDNAPred: Interpretable prediction of protein-DNA binding sites based on pre-trained protein language models.

机构信息

出版信息

相似文献

引用本文的文献

PDNAPred：基于预先训练的蛋白质语言模型的蛋白质-DNA 结合位点的可解释预测。

PDNAPred: Interpretable prediction of protein-DNA binding sites based on pre-trained protein language models.

机构信息

出版信息

相似文献

引用本文的文献