Ma Zicheng, Fan Chuanliu, Wang Zhicong, Chen Zhenyu, Lin Xiaohan, Li Yanheng, Feng Shihao, Cao Ziqiang, Zhang Jun, Gao Yi Qin
Changping Laboratory, Beijing 102200, China.
Academy for Advanced Interdisciplinary Studies, Peking University, Beijing 100871, China.
J Chem Inf Model. 2025 Jul 14;65(13):6599-6612. doi: 10.1021/acs.jcim.5c00585. Epub 2025 Jun 25.
Large language models (LLMs) have made remarkable progress in the field of molecular science, particularly in understanding and generating functional small molecules. This success is largely attributed to the effectiveness of molecular tokenization strategies. In protein science, the amino acid sequence serves as the sole tokenizer for LLMs. However, many fundamental challenges in protein science are inherently structure-dependent. The absence of structure-aware tokens significantly limits the capabilities of LLMs for comprehensive biomolecular comprehension and multimodal generation. To address these challenges, we introduce a novel framework, ProtTeX, which tokenizes the protein sequences, structures, and textual information into a unified discrete space. This innovative approach enables joint training of the LLM exclusively through the Next-Token Prediction paradigm, facilitating multimodal protein reasoning and generation. ProtTeX enables general LLMs to perceive and process protein structures through sequential text input, leverage structural information as intermediate reasoning components, and generate or manipulate structures via sequential text output. Experiments demonstrate that our model achieves significant improvements in protein function prediction, outperforming the state-of-the-art domain expert model with a 2-fold increase in accuracy. Our framework enables high-quality conformational generation and customizable protein design. For the first time, we demonstrate that by adopting the standard training and inference pipelines from the LLM domain, ProtTeX empowers decoder-only LLMs to effectively address a diverse spectrum of protein-related tasks.
大语言模型(LLMs)在分子科学领域取得了显著进展,特别是在理解和生成功能性小分子方面。这一成功很大程度上归功于分子词元化策略的有效性。在蛋白质科学中,氨基酸序列是大语言模型的唯一词元化器。然而,蛋白质科学中的许多基本挑战本质上都依赖于结构。缺乏结构感知词元显著限制了大语言模型进行全面生物分子理解和多模态生成的能力。为了应对这些挑战,我们引入了一个新颖的框架ProtTeX,它将蛋白质序列、结构和文本信息词元化到一个统一的离散空间中。这种创新方法使得大语言模型能够仅通过下一个词元预测范式进行联合训练,促进多模态蛋白质推理和生成。ProtTeX使通用大语言模型能够通过顺序文本输入感知和处理蛋白质结构,将结构信息用作中间推理组件,并通过顺序文本输出生成或操纵结构。实验表明,我们的模型在蛋白质功能预测方面取得了显著改进,准确率提高了两倍,超过了当前最先进的领域专家模型。我们的框架能够实现高质量的构象生成和可定制的蛋白质设计。我们首次证明,通过采用大语言模型领域的标准训练和推理管道,ProtTeX使仅解码器的大语言模型能够有效解决各种与蛋白质相关的任务。