PLM-DBPs：通过整合基于序列和结构感知的蛋白质语言模型增强植物DNA结合蛋白预测

PLM-DBPs: enhancing plant DNA-binding protein prediction by integrating sequence-based and structure-aware protein language models.

作者信息

Pokharel Suresh, Barasa Kepha, Pratyush Pawel, Kc Dukka B

机构信息

Golisano College of Computing and Information Sciences, Rochester Institute of Technology, Rochester 14623, NY, United States.

College of Computing, Michigan Technological University, Houghton 49931, MI, United States.

出版信息

Brief Bioinform. 2025 May 1;26(3). doi: 10.1093/bib/bbaf245.

DOI:10.1093/bib/bbaf245

PMID:40439671

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12121366/

Abstract

DNA-binding proteins (DBPs) play a crucial role in gene regulation, development, and environmental responses across plants, animals, and microorganisms. Existing DBP prediction methods are largely limited to sequence information, whether through handcrafted features or sequence-based protein language models (PLMs), overlooking structural cues critical to protein function. In addition, most existing tools are trained for general DBP predictions, which are often not accurate for plant-specific DBPs due to the unique structural and functional properties of plant proteins. Our work introduces PLM-DBPs, a deep learning framework that integrates both sequence-based and structure-aware representations to enhance DBP prediction in plants. We evaluated several state-of-the-art PLMs to extract high-dimensional protein representations and experimented with various fusion strategies to validate the complementary information between the various representations. Our final model, a fusion of sequence-based and structure-aware ANN models, achieves a notable improvement in predicting DBPs in plants outperforming previous state-of-the-art models. Although sequence-based PLMs already demonstrate strong performance in DBP prediction, our findings show that the integration of structural information further enhances predictive accuracy. This underscores the complementary nature of structural representations and establishes PLM-DBPs as a robust tool for advancing plant research and agricultural innovation. The proposed model and other resources are publicly available at https://github.com/suresh-pokharel/PLM-DBPs.

摘要

DNA结合蛋白（DBP）在植物、动物和微生物的基因调控、发育及环境响应中发挥着关键作用。现有的DBP预测方法很大程度上局限于序列信息，无论是通过手工制作的特征还是基于序列的蛋白质语言模型（PLM），都忽略了对蛋白质功能至关重要的结构线索。此外，大多数现有工具是针对一般DBP预测进行训练的，由于植物蛋白质独特的结构和功能特性，这些工具对植物特异性DBP的预测往往不准确。我们的工作引入了PLM-DBPs，这是一个深度学习框架，它整合了基于序列的表示和结构感知表示，以增强对植物中DBP的预测。我们评估了几种先进的PLM，以提取高维蛋白质表示，并试验了各种融合策略，以验证不同表示之间的互补信息。我们的最终模型是基于序列的和结构感知的人工神经网络模型的融合，在预测植物中的DBP方面取得了显著改进，优于先前的先进模型。尽管基于序列的PLM在DBP预测中已经表现出强大的性能，但我们的研究结果表明，结构信息的整合进一步提高了预测准确性。这突出了结构表示的互补性，并将PLM-DBPs确立为推进植物研究和农业创新的强大工具。所提出的模型和其他资源可在https://github.com/suresh-pokharel/PLM-DBPs上公开获取。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2640/12121366/f1ceaa81ed2a/bbaf245f1.jpg

相似文献

PLM-DBPs: enhancing plant DNA-binding protein prediction by integrating sequence-based and structure-aware protein language models.PLM-DBPs：通过整合基于序列和结构感知的蛋白质语言模型增强植物DNA结合蛋白预测

Brief Bioinform. 2025 May 1;26(3). doi: 10.1093/bib/bbaf245.

S-PLM: Structure-Aware Protein Language Model via Contrastive Learning Between Sequence and Structure.S-PLM：通过序列与结构之间的对比学习实现的结构感知蛋白质语言模型

Adv Sci (Weinh). 2025 Feb;12(5):e2404212. doi: 10.1002/advs.202404212. Epub 2024 Dec 12.

PreDBP-PLMs: Prediction of DNA-binding proteins based on pre-trained protein language models and convolutional neural networks.PreDBP-PLMs：基于预训练蛋白质语言模型和卷积神经网络的DNA结合蛋白预测

Anal Biochem. 2024 Nov;694:115603. doi: 10.1016/j.ab.2024.115603. Epub 2024 Jul 8.

S-PLM: Structure-aware Protein Language Model via Contrastive Learning between Sequence and Structure.S-PLM：通过序列与结构之间的对比学习实现的结构感知蛋白质语言模型。

bioRxiv. 2024 May 13:2023.08.06.552203. doi: 10.1101/2023.08.06.552203.

DeepDNAbP: A deep learning-based hybrid approach to improve the identification of deoxyribonucleic acid-binding proteins.DeepDNAbP：一种基于深度学习的混合方法，用于提高脱氧核糖核酸结合蛋白的识别能力。

Comput Biol Med. 2022 Jun;145:105433. doi: 10.1016/j.compbiomed.2022.105433. Epub 2022 Mar 30.

Porter 6: Protein Secondary Structure Prediction by Leveraging Pre-Trained Language Models (PLMs).波特6：利用预训练语言模型（PLMs）进行蛋白质二级结构预测。

Int J Mol Sci. 2024 Dec 27;26(1):130. doi: 10.3390/ijms26010130.

Deep-WET: a deep learning-based approach for predicting DNA-binding proteins using word embedding techniques with weighted features.深度WET：一种基于深度学习的方法，利用带加权特征的词嵌入技术预测DNA结合蛋白。

Sci Rep. 2024 Feb 5;14(1):2961. doi: 10.1038/s41598-024-52653-9.

iDRBP_MMC: Identifying DNA-Binding Proteins and RNA-Binding Proteins Based on Multi-Label Learning Model and Motif-Based Convolutional Neural Network.iDRBP_MMC：基于多标签学习模型和基于模体的卷积神经网络的 DNA 结合蛋白和 RNA 结合蛋白的鉴定。

J Mol Biol. 2020 Nov 6;432(22):5860-5875. doi: 10.1016/j.jmb.2020.09.008. Epub 2020 Sep 11.

ProCeSa: Contrast-Enhanced Structure-Aware Network for Thermostability Prediction with Protein Language Models.ProCeSa：用于蛋白质语言模型热稳定性预测的对比增强结构感知网络。

J Chem Inf Model. 2025 Mar 10;65(5):2304-2313. doi: 10.1021/acs.jcim.4c01752. Epub 2025 Feb 23.

Improving DNA-Binding Protein Prediction Using Three-Part Sequence-Order Feature Extraction and a Deep Neural Network Algorithm.利用三部分序列顺序特征提取和深度神经网络算法提高 DNA 结合蛋白预测。

J Chem Inf Model. 2023 Feb 13;63(3):1044-1057. doi: 10.1021/acs.jcim.2c00943. Epub 2023 Jan 31.

本文引用的文献

xTrimoPGLM: unified 100-billion-parameter pretrained transformer for deciphering the language of proteins.xTrimoPGLM：用于解读蛋白质语言的统一的1000亿参数预训练变压器。

Nat Methods. 2025 May;22(5):1028-1039. doi: 10.1038/s41592-025-02636-z. Epub 2025 Apr 3.

Simulating 500 million years of evolution with a language model.用语言模型模拟5亿年的进化历程。

Science. 2025 Feb 21;387(6736):850-858. doi: 10.1126/science.ads0018. Epub 2025 Jan 16.

S-PLM: Structure-Aware Protein Language Model via Contrastive Learning Between Sequence and Structure.S-PLM：通过序列与结构之间的对比学习实现的结构感知蛋白质语言模型

Adv Sci (Weinh). 2025 Feb;12(5):e2404212. doi: 10.1002/advs.202404212. Epub 2024 Dec 12.

Bilingual language model for protein sequence and structure.用于蛋白质序列和结构的双语语言模型。

NAR Genom Bioinform. 2024 Nov 15;6(4):lqae150. doi: 10.1093/nargab/lqae150. eCollection 2024 Dec.

Improving prediction performance of general protein language model by domain-adaptive pretraining on DNA-binding protein.通过在 DNA 结合蛋白上进行领域自适应预训练来提高通用蛋白质语言模型的预测性能。

Nat Commun. 2024 Sep 7;15(1):7838. doi: 10.1038/s41467-024-52293-7.

Anal Biochem. 2024 Nov;694:115603. doi: 10.1016/j.ab.2024.115603. Epub 2024 Jul 8.

Protein embeddings predict binding residues in disordered regions.蛋白质嵌入预测无序区域的结合残基。

Sci Rep. 2024 Jun 12;14(1):13566. doi: 10.1038/s41598-024-64211-4.

Improved prediction of DNA and RNA binding proteins with deep learning models.深度学习模型提高 DNA 和 RNA 结合蛋白的预测能力。

Brief Bioinform. 2024 May 23;25(4). doi: 10.1093/bib/bbae285.

LMCrot: an enhanced protein crotonylation site predictor by leveraging an interpretable window-level embedding from a transformer-based protein language model.LMCrot：一种基于转换器的蛋白质语言模型的可解释窗口级嵌入的增强型蛋白质巴豆酰化位点预测器。

Bioinformatics. 2024 May 2;40(5). doi: 10.1093/bioinformatics/btae290.

Fast and accurate protein structure search with Foldseek.使用 Foldseek 进行快速准确的蛋白质结构搜索。

Nat Biotechnol. 2024 Feb;42(2):243-246. doi: 10.1038/s41587-023-01773-0. Epub 2023 May 8.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

PLM-DBPs：通过整合基于序列和结构感知的蛋白质语言模型增强植物DNA结合蛋白预测

PLM-DBPs: enhancing plant DNA-binding protein prediction by integrating sequence-based and structure-aware protein language models.

作者信息

机构信息

出版信息

相似文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

本文引用的文献