• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

通过在 DNA 结合蛋白上进行领域自适应预训练来提高通用蛋白质语言模型的预测性能。

Improving prediction performance of general protein language model by domain-adaptive pretraining on DNA-binding protein.

机构信息

College of Computer Science and Electronic Engineering, Hunan University, Changsha, 410082, China.

出版信息

Nat Commun. 2024 Sep 7;15(1):7838. doi: 10.1038/s41467-024-52293-7.

DOI:10.1038/s41467-024-52293-7
PMID:39244557
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11380688/
Abstract

DNA-protein interactions exert the fundamental structure of many pivotal biological processes, such as DNA replication, transcription, and gene regulation. However, accurate and efficient computational methods for identifying these interactions are still lacking. In this study, we propose a method ESM-DBP through refining the DNA-binding protein sequence repertory and domain-adaptive pretraining based the general protein language model. Our method considers the lacking exploration of general language model for DNA-binding protein domain-specific knowledge, so we screen out 170,264 DNA-binding protein sequences to construct the domain-adaptive language model. Experimental results on four downstream tasks show that ESM-DBP provides a better feature representation of DNA-binding protein compared to the original language model, resulting in improved prediction performance and outperforming the state-of-the-art methods. Moreover, ESM-DBP can still perform well even for those sequences with only a few homologous sequences. ChIP-seq on two predicted cases further support the validity of the proposed method.

摘要

DNA-蛋白质相互作用对许多关键生物过程的基本结构发挥着作用,例如 DNA 复制、转录和基因调控。然而,用于识别这些相互作用的准确和高效的计算方法仍然缺乏。在这项研究中,我们提出了一种通过细化 DNA 结合蛋白序列库和基于通用蛋白质语言模型的域自适应预训练的方法 ESM-DBP。我们的方法考虑了通用语言模型对 DNA 结合蛋白域特定知识的缺乏探索,因此我们筛选出 170264 个 DNA 结合蛋白序列来构建域自适应语言模型。在四个下游任务上的实验结果表明,与原始语言模型相比,ESM-DBP 为 DNA 结合蛋白提供了更好的特征表示,从而提高了预测性能,优于最先进的方法。此外,即使对于那些只有少数同源序列的序列,ESM-DBP 仍然可以很好地执行。对两个预测案例的 ChIP-seq 进一步支持了所提出方法的有效性。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/6c09/11380688/73f882e31e4a/41467_2024_52293_Fig7_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/6c09/11380688/af519947b124/41467_2024_52293_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/6c09/11380688/495b52eb2e14/41467_2024_52293_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/6c09/11380688/80564f0a5232/41467_2024_52293_Fig3_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/6c09/11380688/adfbb7127503/41467_2024_52293_Fig4_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/6c09/11380688/839917c511df/41467_2024_52293_Fig5_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/6c09/11380688/f0af9e892b63/41467_2024_52293_Fig6_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/6c09/11380688/73f882e31e4a/41467_2024_52293_Fig7_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/6c09/11380688/af519947b124/41467_2024_52293_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/6c09/11380688/495b52eb2e14/41467_2024_52293_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/6c09/11380688/80564f0a5232/41467_2024_52293_Fig3_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/6c09/11380688/adfbb7127503/41467_2024_52293_Fig4_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/6c09/11380688/839917c511df/41467_2024_52293_Fig5_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/6c09/11380688/f0af9e892b63/41467_2024_52293_Fig6_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/6c09/11380688/73f882e31e4a/41467_2024_52293_Fig7_HTML.jpg

相似文献

1
Improving prediction performance of general protein language model by domain-adaptive pretraining on DNA-binding protein.通过在 DNA 结合蛋白上进行领域自适应预训练来提高通用蛋白质语言模型的预测性能。
Nat Commun. 2024 Sep 7;15(1):7838. doi: 10.1038/s41467-024-52293-7.
2
Stigma Management Strategies of Autistic Social Media Users.自闭症社交媒体用户的污名管理策略
Autism Adulthood. 2025 May 28;7(3):273-282. doi: 10.1089/aut.2023.0095. eCollection 2025 Jun.
3
Stakeholders' perceptions and experiences of factors influencing the commissioning, delivery, and uptake of general health checks: a qualitative evidence synthesis.利益相关者对影响一般健康检查的委托、提供和接受因素的看法与体验:一项定性证据综合分析
Cochrane Database Syst Rev. 2025 Mar 20;3(3):CD014796. doi: 10.1002/14651858.CD014796.pub2.
4
An augmented transformer model trained on protein family specific variant data leads to improved prediction of variants of uncertain significance.在蛋白质家族特异性变异数据上训练的增强型变压器模型可提高对意义未明变异的预测能力。
Hum Genet. 2025 Mar;144(2-3):143-158. doi: 10.1007/s00439-025-02727-z. Epub 2025 Jan 27.
5
Molecular feature-based classification of retroperitoneal liposarcoma: a prospective cohort study.基于分子特征的腹膜后脂肪肉瘤分类:一项前瞻性队列研究。
Elife. 2025 May 23;14:RP100887. doi: 10.7554/eLife.100887.
6
Community views on mass drug administration for soil-transmitted helminths: a qualitative evidence synthesis.社区对土壤传播蠕虫群体药物给药的看法:定性证据综合分析
Cochrane Database Syst Rev. 2025 Jun 20;6:CD015794. doi: 10.1002/14651858.CD015794.pub2.
7
Assessing the comparative effects of interventions in COPD: a tutorial on network meta-analysis for clinicians.评估慢性阻塞性肺疾病干预措施的比较效果:面向临床医生的网状Meta分析教程
Respir Res. 2024 Dec 21;25(1):438. doi: 10.1186/s12931-024-03056-x.
8
ScITree: Scalable Bayesian inference of transmission tree from epidemiological and genomic data.ScITree:从流行病学和基因组数据中对传播树进行可扩展的贝叶斯推断。
PLoS Comput Biol. 2025 Jun 10;21(6):e1012657. doi: 10.1371/journal.pcbi.1012657. eCollection 2025 Jun.
9
Integrating Gut Microbiome and Metabolomics with Magnetic Resonance Enterography to Advance Bowel Damage Prediction in Crohn's Disease.整合肠道微生物组和代谢组学与磁共振肠造影术以推进克罗恩病肠道损伤预测
J Inflamm Res. 2025 Jun 11;18:7631-7649. doi: 10.2147/JIR.S524671. eCollection 2025.
10
Aural toilet (ear cleaning) for chronic suppurative otitis media.慢性化脓性中耳炎的耳道清理(耳部清洁)
Cochrane Database Syst Rev. 2025 Jun 9;6(6):CD013057. doi: 10.1002/14651858.CD013057.pub3.

引用本文的文献

1
Active learning-guided optimization of cell-free biosensors for lead testing in drinking water.主动学习引导的用于饮用水中铅检测的无细胞生物传感器优化
bioRxiv. 2025 Aug 22:2025.08.20.671382. doi: 10.1101/2025.08.20.671382.
2
Protein Language Model Identifies Disordered, Conserved Motifs Implicated in Phase Separation.蛋白质语言模型识别出与相分离相关的无序保守基序。
bioRxiv. 2025 Jul 23:2024.12.12.628175. doi: 10.1101/2024.12.12.628175.
3
Advancing the accuracy of clathrin protein prediction through multi-source protein language models.

本文引用的文献

1
EquiPNAS: improved protein-nucleic acid binding site prediction using protein-language-model-informed equivariant deep graph neural networks.EquiPNAS:利用基于蛋白质语言模型的等变深度图神经网络提高蛋白质-核酸结合位点预测。
Nucleic Acids Res. 2024 Mar 21;52(5):e27. doi: 10.1093/nar/gkae039.
2
Single-sequence protein structure prediction using supervised transformer protein language models.使用监督式转换器蛋白质语言模型进行单序列蛋白质结构预测。
Nat Comput Sci. 2022 Dec;2(12):804-814. doi: 10.1038/s43588-022-00373-3. Epub 2022 Dec 19.
3
Accurate prediction of protein-nucleic acid complexes using RoseTTAFoldNA.
通过多源蛋白质语言模型提高网格蛋白蛋白质预测的准确性。
Sci Rep. 2025 Jul 8;15(1):24403. doi: 10.1038/s41598-025-08510-4.
4
A Survey of Pretrained Protein Language Models.预训练蛋白质语言模型综述
Methods Mol Biol. 2025;2941:1-29. doi: 10.1007/978-1-0716-4623-6_1.
5
PLM-DBPs: enhancing plant DNA-binding protein prediction by integrating sequence-based and structure-aware protein language models.PLM-DBPs:通过整合基于序列和结构感知的蛋白质语言模型增强植物DNA结合蛋白预测
Brief Bioinform. 2025 May 1;26(3). doi: 10.1093/bib/bbaf245.
6
DRBP-EDP: classification of DNA-binding proteins and RNA-binding proteins using ESM-2 and dual-path neural network.DRBP-EDP:使用ESM-2和双路径神经网络对DNA结合蛋白和RNA结合蛋白进行分类
NAR Genom Bioinform. 2025 May 19;7(2):lqaf058. doi: 10.1093/nargab/lqaf058. eCollection 2025 Jun.
7
Systematic Analysis of Genes in Six Species Reveals the Evolutionary Dynamics, Carotenoid and Anthocyanin Accumulation, and Stress Responses of Sweet Potato.六个物种中基因的系统分析揭示了甘薯的进化动态、类胡萝卜素和花青素积累以及应激反应。
Genes (Basel). 2025 Feb 24;16(3):266. doi: 10.3390/genes16030266.
8
Deep Learning for Predicting Biomolecular Binding Sites of Proteins.用于预测蛋白质生物分子结合位点的深度学习
Research (Wash D C). 2025 Feb 24;8:0615. doi: 10.34133/research.0615. eCollection 2025.
9
PAPreC: A Pipeline for Antigenicity Prediction Comparison Methods across Bacteria.PAPreC:一种用于比较细菌抗原性预测方法的流程
ACS Omega. 2025 Feb 3;10(6):5415-5429. doi: 10.1021/acsomega.4c07147. eCollection 2025 Feb 18.
使用 RoseTTAFoldNA 准确预测蛋白质-核酸复合物。
Nat Methods. 2024 Jan;21(1):117-121. doi: 10.1038/s41592-023-02086-5. Epub 2023 Nov 23.
4
GTF2E2 downregulated by miR-340-5p inhibits the malignant progression of glioblastoma.miR-340-5p 下调 GTF2E2 抑制胶质母细胞瘤的恶性进展。
Cancer Gene Ther. 2023 Dec;30(12):1702-1714. doi: 10.1038/s41417-023-00676-1. Epub 2023 Oct 16.
5
Complementary strategies for directing in vivo transcription factor binding through DNA binding domains and intrinsically disordered regions.通过 DNA 结合结构域和固有无序区域引导体内转录因子结合的互补策略。
Mol Cell. 2023 May 4;83(9):1462-1473.e5. doi: 10.1016/j.molcel.2023.04.002. Epub 2023 Apr 27.
6
Efficient evolution of human antibodies from general protein language models.从通用蛋白质语言模型中高效进化出人类抗体。
Nat Biotechnol. 2024 Feb;42(2):275-283. doi: 10.1038/s41587-023-01763-2. Epub 2023 Apr 24.
7
Evolutionary-scale prediction of atomic-level protein structure with a language model.用语言模型进行原子级蛋白质结构的进化尺度预测。
Science. 2023 Mar 17;379(6637):1123-1130. doi: 10.1126/science.ade2574. Epub 2023 Mar 16.
8
LncRNA-BC069792 suppresses tumor progression by targeting KCNQ4 in breast cancer.LncRNA-BC069792 通过靶向乳腺癌中的 KCNQ4 抑制肿瘤进展。
Mol Cancer. 2023 Mar 1;22(1):41. doi: 10.1186/s12943-023-01747-5.
9
Negatively charged, intrinsically disordered regions can accelerate target search by DNA-binding proteins.带负电荷、固有无序的区域可以加速 DNA 结合蛋白的靶标搜索。
Nucleic Acids Res. 2023 Jun 9;51(10):4701-4712. doi: 10.1093/nar/gkad045.
10
Structural predictions of protein-DNA binding: MELD-DNA.蛋白质-DNA 结合的结构预测:MELD-DNA。
Nucleic Acids Res. 2023 Feb 28;51(4):1625-1636. doi: 10.1093/nar/gkad013.