College of Computer Science and Electronic Engineering, Hunan University, Changsha, 410082, China.
Nat Commun. 2024 Sep 7;15(1):7838. doi: 10.1038/s41467-024-52293-7.
DNA-protein interactions exert the fundamental structure of many pivotal biological processes, such as DNA replication, transcription, and gene regulation. However, accurate and efficient computational methods for identifying these interactions are still lacking. In this study, we propose a method ESM-DBP through refining the DNA-binding protein sequence repertory and domain-adaptive pretraining based the general protein language model. Our method considers the lacking exploration of general language model for DNA-binding protein domain-specific knowledge, so we screen out 170,264 DNA-binding protein sequences to construct the domain-adaptive language model. Experimental results on four downstream tasks show that ESM-DBP provides a better feature representation of DNA-binding protein compared to the original language model, resulting in improved prediction performance and outperforming the state-of-the-art methods. Moreover, ESM-DBP can still perform well even for those sequences with only a few homologous sequences. ChIP-seq on two predicted cases further support the validity of the proposed method.
DNA-蛋白质相互作用对许多关键生物过程的基本结构发挥着作用,例如 DNA 复制、转录和基因调控。然而,用于识别这些相互作用的准确和高效的计算方法仍然缺乏。在这项研究中,我们提出了一种通过细化 DNA 结合蛋白序列库和基于通用蛋白质语言模型的域自适应预训练的方法 ESM-DBP。我们的方法考虑了通用语言模型对 DNA 结合蛋白域特定知识的缺乏探索,因此我们筛选出 170264 个 DNA 结合蛋白序列来构建域自适应语言模型。在四个下游任务上的实验结果表明,与原始语言模型相比,ESM-DBP 为 DNA 结合蛋白提供了更好的特征表示,从而提高了预测性能,优于最先进的方法。此外,即使对于那些只有少数同源序列的序列,ESM-DBP 仍然可以很好地执行。对两个预测案例的 ChIP-seq 进一步支持了所提出方法的有效性。