利用T5ProtChem对分子和蛋白质语言表征进行统一深度学习。

Unified Deep Learning of Molecular and Protein Language Representations with T5ProtChem.

作者信息

Kelly Thomas, Xia Song, Lu Jieyu, Zhang Yingkai

机构信息

Department of Chemistry, New York University, New York, New York 10003, United States.

Simons Center for Computational Physical Chemistry at New York University, New York, New York 10003, United States.

出版信息

J Chem Inf Model. 2025 Apr 28;65(8):3990-3998. doi: 10.1021/acs.jcim.5c00051. Epub 2025 Apr 8.

DOI:10.1021/acs.jcim.5c00051

PMID:40197028

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12042257/

Abstract

Deep learning has revolutionized difficult tasks in chemistry and biology, yet existing language models often treat these domains separately, relying on concatenated architectures and independently pretrained weights. These approaches fail to fully exploit the shared atomic foundations of molecular and protein sequences. Here, we introduce T5ProtChem, a unified model based on the T5 architecture, designed to simultaneously process molecular and protein sequences. Using a new pretraining objective, ProtiSMILES, T5ProtChem bridges the molecular and protein domains, enabling efficient, generalizable protein-chemical modeling. The model achieves a state-of-the-art performance in tasks such as binding affinity prediction and reaction prediction, while having a strong performance in protein function prediction. Additionally, it supports novel applications, including covalent binder classification and sequence-level adduct prediction. These results demonstrate the versatility of unified language models for drug discovery, protein engineering, and other interdisciplinary efforts in computational biology and chemistry.

摘要

深度学习彻底改变了化学和生物学中的难题，但现有的语言模型通常将这些领域分开处理，依赖于拼接架构和独立预训练的权重。这些方法未能充分利用分子和蛋白质序列共有的原子基础。在此，我们引入了T5ProtChem，这是一种基于T5架构的统一模型，旨在同时处理分子和蛋白质序列。通过使用一种新的预训练目标ProtiSMILES，T5ProtChem架起了分子和蛋白质领域之间的桥梁，实现了高效、可推广的蛋白质-化学建模。该模型在结合亲和力预测和反应预测等任务中取得了领先的性能，同时在蛋白质功能预测方面也表现出色。此外，它还支持新的应用，包括共价结合剂分类和序列水平加合物预测。这些结果证明了统一语言模型在药物发现、蛋白质工程以及计算生物学和化学中的其他跨学科研究中的通用性。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/637b/12042257/02327f88915f/ci5c00051_0001.jpg

相似文献

Unified Deep Learning of Molecular and Protein Language Representations with T5ProtChem.利用T5ProtChem对分子和蛋白质语言表征进行统一深度学习。

J Chem Inf Model. 2025 Apr 28;65(8):3990-3998. doi: 10.1021/acs.jcim.5c00051. Epub 2025 Apr 8.

ProteinBERT: a universal deep-learning model of protein sequence and function.蛋白质 BERT：一种通用的蛋白质序列和功能深度学习模型。

Bioinformatics. 2022 Apr 12;38(8):2102-2110. doi: 10.1093/bioinformatics/btac020.

Drug-Target Binding Affinity Prediction in a Continuous Latent Space Using Variational Autoencoders.基于变分自编码器的连续潜在空间中药物-靶标结合亲和力预测。

IEEE/ACM Trans Comput Biol Bioinform. 2024 Sep-Oct;21(5):1458-1467. doi: 10.1109/TCBB.2024.3402661. Epub 2024 Oct 9.

Identifying RNA-small Molecule Binding Sites Using Geometric Deep Learning with Language Models.使用带有语言模型的几何深度学习识别RNA-小分子结合位点

J Mol Biol. 2025 Apr 15;437(8):169010. doi: 10.1016/j.jmb.2025.169010. Epub 2025 Feb 15.

TC-DTA: Predicting Drug-Target Binding Affinity With Transformer and Convolutional Neural Networks.TC-DTA：基于 Transformer 和卷积神经网络的药物-靶标结合亲和力预测。

IEEE Trans Nanobioscience. 2024 Oct;23(4):572-578. doi: 10.1109/TNB.2024.3441590. Epub 2024 Oct 15.

Predicting Drug-Target Interactions with Deep-Embedding Learning of Graphs and Sequences.基于图和序列深度学习嵌入预测药物-靶标相互作用。

J Phys Chem A. 2021 Jul 1;125(25):5633-5642. doi: 10.1021/acs.jpca.1c02419. Epub 2021 Jun 18.

Drug-Target Interaction Prediction: End-to-End Deep Learning Approach.药物-靶点相互作用预测：端到端深度学习方法。

IEEE/ACM Trans Comput Biol Bioinform. 2021 Nov-Dec;18(6):2364-2374. doi: 10.1109/TCBB.2020.2977335. Epub 2021 Dec 8.

SOFB is a comprehensive ensemble deep learning approach for elucidating and characterizing protein-nucleic-acid-binding residues.SOFB 是一种全面的集成深度学习方法，用于阐明和描述蛋白质-核酸结合残基。

Commun Biol. 2024 Jun 3;7(1):679. doi: 10.1038/s42003-024-06332-0.

A Hierarchical Graph Neural Network Framework for Predicting Protein-Protein Interaction Modulators With Functional Group Information and Hypergraph Structure.基于功能基团信息和超图结构的层次图神经网络框架预测蛋白质-蛋白质相互作用调节剂

IEEE J Biomed Health Inform. 2024 Jul;28(7):4295-4305. doi: 10.1109/JBHI.2024.3384238. Epub 2024 Jul 2.

Contrastive learning in protein language space predicts interactions between drugs and protein targets.蛋白质语言空间中的对比学习可预测药物与蛋白质靶标之间的相互作用。

Proc Natl Acad Sci U S A. 2023 Jun 13;120(24):e2220778120. doi: 10.1073/pnas.2220778120. Epub 2023 Jun 8.

本文引用的文献

A review of large language models and autonomous agents in chemistry.化学领域中大型语言模型与自主智能体的综述。

Chem Sci. 2024 Dec 9;16(6):2514-2572. doi: 10.1039/d4sc03921a. eCollection 2025 Feb 5.

Simulating 500 million years of evolution with a language model.用语言模型模拟5亿年的进化历程。

Science. 2025 Feb 21;387(6736):850-858. doi: 10.1126/science.ads0018. Epub 2025 Jan 16.

Comparing SMILES and SELFIES tokenization for enhanced chemical language modeling.比较 SMILES 和 SELFIES 标记化以增强化学语言建模。

Sci Rep. 2024 Oct 23;14(1):25016. doi: 10.1038/s41598-024-76440-8.

Large Language Models as Molecular Design Engines.大语言模型作为分子设计引擎。

J Chem Inf Model. 2024 Sep 23;64(18):7086-7096. doi: 10.1021/acs.jcim.4c01396. Epub 2024 Sep 4.

Biospecific Chemistry for Covalent Linking of Biomacromolecules.生物特异性化学用于生物大分子的共价连接。

Chem Rev. 2024 Jul 10;124(13):8516-8549. doi: 10.1021/acs.chemrev.4c00066. Epub 2024 Jun 24.

A Survey on Self-Supervised Learning: Algorithms, Applications, and Future Trends.自监督学习综述：算法、应用及未来趋势

IEEE Trans Pattern Anal Mach Intell. 2024 Dec;46(12):9052-9071. doi: 10.1109/TPAMI.2024.3415112. Epub 2024 Nov 6.

nach0: multimodal natural and chemical languages foundation model.Nach0：多模态自然与化学语言基础模型。

Chem Sci. 2024 May 8;15(22):8380-8389. doi: 10.1039/d4sc00966e. eCollection 2024 Jun 5.

Augmenting large language models with chemistry tools.用化学工具增强大语言模型。

Nat Mach Intell. 2024;6(5):525-535. doi: 10.1038/s42256-024-00832-8. Epub 2024 May 8.

Language models can identify enzymatic binding sites in protein sequences.语言模型可以识别蛋白质序列中的酶结合位点。

Comput Struct Biotechnol J. 2024 Apr 30;23:1929-1937. doi: 10.1016/j.csbj.2024.04.012. eCollection 2024 Dec.

Accurate structure prediction of biomolecular interactions with AlphaFold 3.利用 AlphaFold 3 进行生物分子相互作用的精确结构预测。

Nature. 2024 Jun;630(8016):493-500. doi: 10.1038/s41586-024-07487-w. Epub 2024 May 8.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

利用T5ProtChem对分子和蛋白质语言表征进行统一深度学习。

Unified Deep Learning of Molecular and Protein Language Representations with T5ProtChem.

作者信息

机构信息

出版信息

相似文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

本文引用的文献