• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

PDNAPred:基于预先训练的蛋白质语言模型的蛋白质-DNA 结合位点的可解释预测。

PDNAPred: Interpretable prediction of protein-DNA binding sites based on pre-trained protein language models.

机构信息

College of Information Technology, Shanghai Ocean University, Shanghai 201306, China.

College of Information Technology, Shanghai Ocean University, Shanghai 201306, China.

出版信息

Int J Biol Macromol. 2024 Nov;281(Pt 2):136147. doi: 10.1016/j.ijbiomac.2024.136147. Epub 2024 Oct 1.

DOI:10.1016/j.ijbiomac.2024.136147
PMID:39357703
Abstract

Protein-DNA interactions play critical roles in various biological processes and are essential for drug discovery. However, traditional experimental methods are labor-intensive and unable to keep pace with the increasing volume of protein sequences, leading to a substantial number of proteins lacking DNA-binding annotations. Therefore, developing an efficient computational method to identify protein-DNA binding sites is crucial. Unfortunately, most existing computational methods rely on manually selected features or protein structure information, making these methods inapplicable to large-scale prediction tasks. In this study, we introduced PDNAPred, a sequence-based method that combines two pre-trained protein language models with a designed CNN-GRU network to identify DNA-binding sites. Additionally, to tackle the issue of imbalanced dataset samples, we employed focal loss. Our comprehensive experiments demonstrated that PDNAPred significantly improved the accuracy of DNA-binding site prediction, outperforming existing state-of-the-art sequence-based methods. Remarkably, PDNAPred also achieved results comparable to advanced structure-based methods. The designed CNN-GRU network enhances its capability to detect DNA-binding sites accurately. Furthermore, we validated the versatility of PDNAPred by training it on RNA-binding site datasets, showing its potential as a general framework for amino acid binding site prediction. Finally, we conducted model interpretability analysis to elucidate the reasons behind PDNAPred's outstanding performance.

摘要

蛋白质与 DNA 的相互作用在各种生物过程中起着至关重要的作用,是药物发现的关键。然而,传统的实验方法繁琐且无法跟上日益增长的蛋白质序列数量,导致大量蛋白质缺乏 DNA 结合注释。因此,开发一种有效的计算方法来识别蛋白质-DNA 结合位点至关重要。不幸的是,大多数现有的计算方法依赖于手动选择的特征或蛋白质结构信息,这使得这些方法不适用于大规模的预测任务。在这项研究中,我们引入了 PDNAPred,这是一种基于序列的方法,它结合了两个预先训练的蛋白质语言模型和一个设计的 CNN-GRU 网络,用于识别 DNA 结合位点。此外,为了解决不平衡数据集样本的问题,我们采用了焦点损失。我们的综合实验表明,PDNAPred 显著提高了 DNA 结合位点预测的准确性,优于现有的基于序列的最先进方法。值得注意的是,PDNAPred 的表现也与先进的基于结构的方法相当。设计的 CNN-GRU 网络增强了其准确检测 DNA 结合位点的能力。此外,我们通过在 RNA 结合位点数据集上训练 PDNAPred 来验证其多功能性,表明其有潜力成为氨基酸结合位点预测的通用框架。最后,我们进行了模型可解释性分析,以阐明 PDNAPred 卓越表现的原因。

相似文献

1
PDNAPred: Interpretable prediction of protein-DNA binding sites based on pre-trained protein language models.PDNAPred:基于预先训练的蛋白质语言模型的蛋白质-DNA 结合位点的可解释预测。
Int J Biol Macromol. 2024 Nov;281(Pt 2):136147. doi: 10.1016/j.ijbiomac.2024.136147. Epub 2024 Oct 1.
2
Deciphering the Language of Protein-DNA Interactions: A Deep Learning Approach Combining Contextual Embeddings and Multi-Scale Sequence Modeling.解析蛋白质- DNA 相互作用的语言:结合上下文嵌入和多尺度序列建模的深度学习方法。
J Mol Biol. 2024 Nov 15;436(22):168769. doi: 10.1016/j.jmb.2024.168769. Epub 2024 Aug 29.
3
DeepDBS: Identification of DNA-binding sites in protein sequences by using deep representations and random forest.DeepDBS:利用深度表示和随机森林识别蛋白质序列中的 DNA 结合位点。
Methods. 2024 Nov;231:26-36. doi: 10.1016/j.ymeth.2024.09.004. Epub 2024 Sep 11.
4
Survey of Computational Approaches for Prediction of DNA-Binding Residues on Protein Surfaces.蛋白质表面DNA结合残基预测的计算方法综述。
Methods Mol Biol. 2018;1754:223-234. doi: 10.1007/978-1-4939-7717-8_13.
5
EGPDI: identifying protein-DNA binding sites based on multi-view graph embedding fusion.EGPDI:基于多视图图嵌入融合的蛋白质-DNA 结合位点识别。
Brief Bioinform. 2024 May 23;25(4). doi: 10.1093/bib/bbae330.
6
PreDBP-PLMs: Prediction of DNA-binding proteins based on pre-trained protein language models and convolutional neural networks.PreDBP-PLMs:基于预训练蛋白质语言模型和卷积神经网络的DNA结合蛋白预测
Anal Biochem. 2024 Nov;694:115603. doi: 10.1016/j.ab.2024.115603. Epub 2024 Jul 8.
7
Improving prediction performance of general protein language model by domain-adaptive pretraining on DNA-binding protein.通过在 DNA 结合蛋白上进行领域自适应预训练来提高通用蛋白质语言模型的预测性能。
Nat Commun. 2024 Sep 7;15(1):7838. doi: 10.1038/s41467-024-52293-7.
8
An overview of the prediction of protein DNA-binding sites.蛋白质DNA结合位点预测综述。
Int J Mol Sci. 2015 Mar 6;16(3):5194-215. doi: 10.3390/ijms16035194.
9
Predicting DNA-binding sites of proteins based on sequential and 3D structural information.基于序列和 3D 结构信息预测蛋白质的 DNA 结合位点。
Mol Genet Genomics. 2014 Jun;289(3):489-99. doi: 10.1007/s00438-014-0812-x. Epub 2014 Jan 22.
10
ULDNA: integrating unsupervised multi-source language models with LSTM-attention network for high-accuracy protein-DNA binding site prediction.ULDNA:将无监督多源语言模型与 LSTM-注意力网络集成,以实现高精度的蛋白质-DNA 结合位点预测。
Brief Bioinform. 2024 Jan 22;25(2). doi: 10.1093/bib/bbae040.

引用本文的文献

1
Predicting nucleic acid binding sites by attention map-guided graph convolutional network with protein language embeddings and physicochemical information.利用注意力图引导的图卷积网络结合蛋白质语言嵌入和物理化学信息预测核酸结合位点。
Brief Bioinform. 2025 Aug 31;26(5). doi: 10.1093/bib/bbaf457.
2
Language Modelling Techniques for Analysing the Impact of Human Genetic Variation.用于分析人类基因变异影响的语言建模技术
Bioinform Biol Insights. 2025 Sep 2;19:11779322251358314. doi: 10.1177/11779322251358314. eCollection 2025.
3
Advancing the accuracy of clathrin protein prediction through multi-source protein language models.
通过多源蛋白质语言模型提高网格蛋白蛋白质预测的准确性。
Sci Rep. 2025 Jul 8;15(1):24403. doi: 10.1038/s41598-025-08510-4.
4
A Survey of Pretrained Protein Language Models.预训练蛋白质语言模型综述
Methods Mol Biol. 2025;2941:1-29. doi: 10.1007/978-1-0716-4623-6_1.
5
PLM-ATG: Identification of Autophagy Proteins by Integrating Protein Language Model Embeddings with PSSM-Based Features.PLM-ATG:通过将蛋白质语言模型嵌入与基于位置特异性得分矩阵的特征相结合来鉴定自噬蛋白
Molecules. 2025 Apr 10;30(8):1704. doi: 10.3390/molecules30081704.
6
Twenty years of advances in prediction of nucleic acid-binding residues in protein sequences.蛋白质序列中核酸结合残基预测二十年进展
Brief Bioinform. 2024 Nov 22;26(1). doi: 10.1093/bib/bbaf016.