用于先进蛋白质工程的统一序列结构编码——一种多模态扩散变换器。

Unifying sequence-structure coding for advanced protein engineering a multimodal diffusion transformer.

作者信息

Lin Xiaohan, Chen Zhenyu, Li Yanheng, Ma Zicheng, Fan Chuanliu, Cao Ziqiang, Feng Shihao, Zhang Jun, Gao Yi Qin

机构信息

Beijing National Laboratory for Molecular Sciences, College of Chemistry and Molecular Engineering, Peking University Beijing 100871 China

Changping Laboratory Beijing 102200 China

出版信息

Chem Sci. 2025 May 15. doi: 10.1039/d5sc02055g.

DOI:10.1039/d5sc02055g

PMID:40417294

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12096517/

Abstract

Modern protein engineering demands integrated sequence-structure representations to tackle key challenges in designing, modifying, and evolving proteins for specific functions. While sequence-based methods are promising for generating novel proteins, incorporating structure-oriented information improves the success rate and helps target corresponding functions. Therefore, rather than relying solely on sequence or structure-based approaches, a consensus strategy is essential. Here, we introduce ProTokens, machine-learned "amino acids" derived from structural databases self-supervised learning, providing a compact yet information-rich representation that bridges sequence and structure modalities. Instead of treating sequences and structures separately, we build PT-DiT, a multimodal diffusion transformer-based model that integrates both into a unified representation, enabling protein engineering in a joint sequence-structure space, streamlining the design process and facilitating the efficient encoding of 3D folds, contextual protein design, sampling of metastable states, and directed evolution for diverse objectives. Therefore, as a unified solution for protein engineering, PT-DiT leverages sequence and structure insights to realize functional protein design.

摘要

现代蛋白质工程需要整合的序列-结构表示，以应对在设计、修饰和进化具有特定功能的蛋白质方面的关键挑战。虽然基于序列的方法有望生成新型蛋白质，但纳入面向结构的信息可提高成功率并有助于靶向相应功能。因此，共识策略至关重要，而不是仅仅依赖基于序列或结构的方法。在这里，我们引入了ProTokens，这是一种通过自监督学习从结构数据库中衍生出来的机器学习“氨基酸”，它提供了一种紧凑但信息丰富的表示，架起了序列和结构模态之间的桥梁。我们没有分别处理序列和结构，而是构建了PT-DiT，这是一个基于多模态扩散变换器的模型，它将两者整合到一个统一的表示中，能够在联合序列-结构空间中进行蛋白质工程，简化设计过程，并促进三维折叠的高效编码、上下文蛋白质设计、亚稳态采样以及针对不同目标的定向进化。因此，作为蛋白质工程的统一解决方案，PT-DiT利用序列和结构见解来实现功能性蛋白质设计。