Kelly Thomas, Xia Song, Lu Jieyu, Zhang Yingkai
Department of Chemistry, New York University, New York, New York 10003, United States.
Simons Center for Computational Physical Chemistry at New York University, New York, New York 10003, United States.
J Chem Inf Model. 2025 Apr 28;65(8):3990-3998. doi: 10.1021/acs.jcim.5c00051. Epub 2025 Apr 8.
Deep learning has revolutionized difficult tasks in chemistry and biology, yet existing language models often treat these domains separately, relying on concatenated architectures and independently pretrained weights. These approaches fail to fully exploit the shared atomic foundations of molecular and protein sequences. Here, we introduce T5ProtChem, a unified model based on the T5 architecture, designed to simultaneously process molecular and protein sequences. Using a new pretraining objective, ProtiSMILES, T5ProtChem bridges the molecular and protein domains, enabling efficient, generalizable protein-chemical modeling. The model achieves a state-of-the-art performance in tasks such as binding affinity prediction and reaction prediction, while having a strong performance in protein function prediction. Additionally, it supports novel applications, including covalent binder classification and sequence-level adduct prediction. These results demonstrate the versatility of unified language models for drug discovery, protein engineering, and other interdisciplinary efforts in computational biology and chemistry.
深度学习彻底改变了化学和生物学中的难题,但现有的语言模型通常将这些领域分开处理,依赖于拼接架构和独立预训练的权重。这些方法未能充分利用分子和蛋白质序列共有的原子基础。在此,我们引入了T5ProtChem,这是一种基于T5架构的统一模型,旨在同时处理分子和蛋白质序列。通过使用一种新的预训练目标ProtiSMILES,T5ProtChem架起了分子和蛋白质领域之间的桥梁,实现了高效、可推广的蛋白质-化学建模。该模型在结合亲和力预测和反应预测等任务中取得了领先的性能,同时在蛋白质功能预测方面也表现出色。此外,它还支持新的应用,包括共价结合剂分类和序列水平加合物预测。这些结果证明了统一语言模型在药物发现、蛋白质工程以及计算生物学和化学中的其他跨学科研究中的通用性。