Suppr超能文献

解析蛋白质- DNA 相互作用的语言:结合上下文嵌入和多尺度序列建模的深度学习方法。

Deciphering the Language of Protein-DNA Interactions: A Deep Learning Approach Combining Contextual Embeddings and Multi-Scale Sequence Modeling.

机构信息

Department of Computer Science and Engineering, Yuan Ze University, Chung-Li 32003, Taiwan.

Department of Computer Science and Engineering, Yuan Ze University, Chung-Li 32003, Taiwan; Graduate Program in Biomedical Informatics, Yuan Ze University, Chung-Li 32003, Taiwan.

出版信息

J Mol Biol. 2024 Nov 15;436(22):168769. doi: 10.1016/j.jmb.2024.168769. Epub 2024 Aug 29.

Abstract

Deciphering the mechanisms governing protein-DNA interactions is crucial for understanding key cellular processes and disease pathways. In this work, we present a powerful deep learning approach that significantly advances the computational prediction of DNA-interacting residues from protein sequences. Our method leverages the rich contextual representations learned by pre-trained protein language models, such as ProtTrans, to capture intrinsic biochemical properties and sequence motifs indicative of DNA binding sites. We then integrate these contextual embeddings with a multi-window convolutional neural network architecture, which scans across the sequence at varying window sizes to effectively identify both local and global binding patterns. Comprehensive evaluation on curated benchmark datasets demonstrates the remarkable performance of our approach, achieving an area under the ROC curve (AUC) of 0.89 - a substantial improvement over previous state-of-the-art sequence-based predictors. This showcases the immense potential of pairing advanced representation learning and deep neural network designs for uncovering the complex syntax governing protein-DNA interactions directly from primary sequences. Our work not only provides a robust computational tool for characterizing DNA-binding mechanisms, but also highlights the transformative opportunities at the intersection of language modeling, deep learning, and protein sequence analysis. The publicly available code and data further facilitate broader adoption and continued development of these techniques for accelerating mechanistic insights into vital biological processes and disease pathways. In addition, the code and data for this work are available at https://github.com/B1607/DIRP.

摘要

解析蛋白质与 DNA 相互作用的机制对于理解关键的细胞过程和疾病途径至关重要。在这项工作中,我们提出了一种强大的深度学习方法,该方法在从蛋白质序列计算预测与 DNA 相互作用的残基方面取得了显著进展。我们的方法利用了经过预训练的蛋白质语言模型(如 ProtTrans)所学习到的丰富的上下文表示,以捕获内在的生化特性和序列基序,这些特性和序列基序表明了 DNA 结合位点的存在。然后,我们将这些上下文嵌入与多窗口卷积神经网络架构相结合,该架构可以在不同的窗口大小下扫描序列,从而有效地识别局部和全局结合模式。在经过精心整理的基准数据集上进行全面评估表明,我们的方法具有出色的性能,其 ROC 曲线下面积(AUC)达到 0.89-这相较于以前基于序列的最先进预测器有了实质性的改进。这展示了将高级表示学习和深度神经网络设计相结合,直接从原始序列中揭示控制蛋白质与 DNA 相互作用的复杂语法的巨大潜力。我们的工作不仅为描述 DNA 结合机制提供了一种强大的计算工具,而且还强调了语言模型、深度学习和蛋白质序列分析交叉点带来的变革性机会。可公开获取的代码和数据进一步促进了这些技术的广泛采用和持续发展,从而加速对重要生物学过程和疾病途径的机制见解。此外,这项工作的代码和数据可在 https://github.com/B1607/DIRP 上获取。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验