College of Information Technology, Shanghai Ocean University, Shanghai 201306, China.
College of Information Technology, Shanghai Ocean University, Shanghai 201306, China.
Int J Biol Macromol. 2024 Nov;281(Pt 2):136147. doi: 10.1016/j.ijbiomac.2024.136147. Epub 2024 Oct 1.
Protein-DNA interactions play critical roles in various biological processes and are essential for drug discovery. However, traditional experimental methods are labor-intensive and unable to keep pace with the increasing volume of protein sequences, leading to a substantial number of proteins lacking DNA-binding annotations. Therefore, developing an efficient computational method to identify protein-DNA binding sites is crucial. Unfortunately, most existing computational methods rely on manually selected features or protein structure information, making these methods inapplicable to large-scale prediction tasks. In this study, we introduced PDNAPred, a sequence-based method that combines two pre-trained protein language models with a designed CNN-GRU network to identify DNA-binding sites. Additionally, to tackle the issue of imbalanced dataset samples, we employed focal loss. Our comprehensive experiments demonstrated that PDNAPred significantly improved the accuracy of DNA-binding site prediction, outperforming existing state-of-the-art sequence-based methods. Remarkably, PDNAPred also achieved results comparable to advanced structure-based methods. The designed CNN-GRU network enhances its capability to detect DNA-binding sites accurately. Furthermore, we validated the versatility of PDNAPred by training it on RNA-binding site datasets, showing its potential as a general framework for amino acid binding site prediction. Finally, we conducted model interpretability analysis to elucidate the reasons behind PDNAPred's outstanding performance.
蛋白质与 DNA 的相互作用在各种生物过程中起着至关重要的作用,是药物发现的关键。然而,传统的实验方法繁琐且无法跟上日益增长的蛋白质序列数量,导致大量蛋白质缺乏 DNA 结合注释。因此,开发一种有效的计算方法来识别蛋白质-DNA 结合位点至关重要。不幸的是,大多数现有的计算方法依赖于手动选择的特征或蛋白质结构信息,这使得这些方法不适用于大规模的预测任务。在这项研究中,我们引入了 PDNAPred,这是一种基于序列的方法,它结合了两个预先训练的蛋白质语言模型和一个设计的 CNN-GRU 网络,用于识别 DNA 结合位点。此外,为了解决不平衡数据集样本的问题,我们采用了焦点损失。我们的综合实验表明,PDNAPred 显著提高了 DNA 结合位点预测的准确性,优于现有的基于序列的最先进方法。值得注意的是,PDNAPred 的表现也与先进的基于结构的方法相当。设计的 CNN-GRU 网络增强了其准确检测 DNA 结合位点的能力。此外,我们通过在 RNA 结合位点数据集上训练 PDNAPred 来验证其多功能性,表明其有潜力成为氨基酸结合位点预测的通用框架。最后,我们进行了模型可解释性分析,以阐明 PDNAPred 卓越表现的原因。