• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

ProtTeX:使用大语言模型进行蛋白质的上下文结构推理与编辑

ProtTeX: Structure-In-Context Reasoning and Editing of Proteins with Large Language Models.

作者信息

Ma Zicheng, Fan Chuanliu, Wang Zhicong, Chen Zhenyu, Lin Xiaohan, Li Yanheng, Feng Shihao, Cao Ziqiang, Zhang Jun, Gao Yi Qin

机构信息

Changping Laboratory, Beijing 102200, China.

Academy for Advanced Interdisciplinary Studies, Peking University, Beijing 100871, China.

出版信息

J Chem Inf Model. 2025 Jul 14;65(13):6599-6612. doi: 10.1021/acs.jcim.5c00585. Epub 2025 Jun 25.

DOI:10.1021/acs.jcim.5c00585
PMID:40560205
Abstract

Large language models (LLMs) have made remarkable progress in the field of molecular science, particularly in understanding and generating functional small molecules. This success is largely attributed to the effectiveness of molecular tokenization strategies. In protein science, the amino acid sequence serves as the sole tokenizer for LLMs. However, many fundamental challenges in protein science are inherently structure-dependent. The absence of structure-aware tokens significantly limits the capabilities of LLMs for comprehensive biomolecular comprehension and multimodal generation. To address these challenges, we introduce a novel framework, ProtTeX, which tokenizes the protein sequences, structures, and textual information into a unified discrete space. This innovative approach enables joint training of the LLM exclusively through the Next-Token Prediction paradigm, facilitating multimodal protein reasoning and generation. ProtTeX enables general LLMs to perceive and process protein structures through sequential text input, leverage structural information as intermediate reasoning components, and generate or manipulate structures via sequential text output. Experiments demonstrate that our model achieves significant improvements in protein function prediction, outperforming the state-of-the-art domain expert model with a 2-fold increase in accuracy. Our framework enables high-quality conformational generation and customizable protein design. For the first time, we demonstrate that by adopting the standard training and inference pipelines from the LLM domain, ProtTeX empowers decoder-only LLMs to effectively address a diverse spectrum of protein-related tasks.

摘要

大语言模型(LLMs)在分子科学领域取得了显著进展,特别是在理解和生成功能性小分子方面。这一成功很大程度上归功于分子词元化策略的有效性。在蛋白质科学中,氨基酸序列是大语言模型的唯一词元化器。然而,蛋白质科学中的许多基本挑战本质上都依赖于结构。缺乏结构感知词元显著限制了大语言模型进行全面生物分子理解和多模态生成的能力。为了应对这些挑战,我们引入了一个新颖的框架ProtTeX,它将蛋白质序列、结构和文本信息词元化到一个统一的离散空间中。这种创新方法使得大语言模型能够仅通过下一个词元预测范式进行联合训练,促进多模态蛋白质推理和生成。ProtTeX使通用大语言模型能够通过顺序文本输入感知和处理蛋白质结构,将结构信息用作中间推理组件,并通过顺序文本输出生成或操纵结构。实验表明,我们的模型在蛋白质功能预测方面取得了显著改进,准确率提高了两倍,超过了当前最先进的领域专家模型。我们的框架能够实现高质量的构象生成和可定制的蛋白质设计。我们首次证明,通过采用大语言模型领域的标准训练和推理管道,ProtTeX使仅解码器的大语言模型能够有效解决各种与蛋白质相关的任务。

相似文献

1
ProtTeX: Structure-In-Context Reasoning and Editing of Proteins with Large Language Models.ProtTeX:使用大语言模型进行蛋白质的上下文结构推理与编辑
J Chem Inf Model. 2025 Jul 14;65(13):6599-6612. doi: 10.1021/acs.jcim.5c00585. Epub 2025 Jun 25.
2
The first step is the hardest: pitfalls of representing and tokenizing temporal data for large language models.第一步是最困难的:为大型语言模型表示和标记时间数据的陷阱。
J Am Med Inform Assoc. 2024 Sep 1;31(9):2151-2158. doi: 10.1093/jamia/ocae090.
3
Evaluating and Improving Syndrome Differentiation Thinking Ability in Large Language Models: Method Development Study.评估和提高大语言模型中的辨证思维能力:方法开发研究
JMIR Med Inform. 2025 Jun 20;13:e75103. doi: 10.2196/75103.
4
A dataset and benchmark for hospital course summarization with adapted large language models.一个用于医院病程总结的数据集和基准测试,采用了适配的大语言模型。
J Am Med Inform Assoc. 2025 Mar 1;32(3):470-479. doi: 10.1093/jamia/ocae312.
5
Implementing Large Language Models in Health Care: Clinician-Focused Review With Interactive Guideline.在医疗保健中应用大语言模型:以临床医生为重点的回顾与交互式指南
J Med Internet Res. 2025 Jul 11;27:e71916. doi: 10.2196/71916.
6
Short-Term Memory Impairment短期记忆障碍
7
Enhancing Pulmonary Disease Prediction Using Large Language Models With Feature Summarization and Hybrid Retrieval-Augmented Generation: Multicenter Methodological Study Based on Radiology Report.使用具有特征总结和混合检索增强生成功能的大语言模型增强肺部疾病预测:基于放射学报告的多中心方法学研究
J Med Internet Res. 2025 Jun 11;27:e72638. doi: 10.2196/72638.
8
Applications and Concerns of ChatGPT and Other Conversational Large Language Models in Health Care: Systematic Review.ChatGPT 及其他会话型大型语言模型在医疗保健中的应用及关注:系统评价。
J Med Internet Res. 2024 Nov 7;26:e22769. doi: 10.2196/22769.
9
Use of Large Language Models to Classify Epidemiological Characteristics in Synthetic and Real-World Social Media Posts About Conjunctivitis Outbreaks: Infodemiology Study.利用大语言模型对合成及真实世界社交媒体上有关结膜炎爆发的帖子中的流行病学特征进行分类:信息流行病学研究
J Med Internet Res. 2025 Jul 2;27:e65226. doi: 10.2196/65226.
10
Fine-tuning medical language models for enhanced long-contextual understanding and domain expertise.微调医学语言模型以增强长上下文理解和领域专业知识。
Quant Imaging Med Surg. 2025 Jun 6;15(6):5450-5462. doi: 10.21037/qims-2024-2655. Epub 2025 Jun 3.