Yin Mingze, Zhou Hanjing, Zhu Yiheng, Lin Miao, Wu Yixuan, Wu Jialu, Xu Hongxia, Hsieh Chang-Yu, Hou Tingjun, Chen Jintai, Wu Jian
School of Medicine, Zhejiang University, Hangzhou, China.
College of Computer Science and Technology, Zhejiang University, Hangzhou, China.
Health Data Sci. 2024 Dec 19;4:0211. doi: 10.34133/hds.0211. eCollection 2024.
Proteins govern most biological functions essential for life, and achieving controllable protein editing has made great advances in probing natural systems, creating therapeutic conjugates, and generating novel protein constructs. Recently, machine learning-assisted protein editing (MLPE) has shown promise in accelerating optimization cycles and reducing experimental workloads. However, current methods struggle with the vast combinatorial space of potential protein edits and cannot explicitly conduct protein editing using biotext instructions, limiting their interactivity with human feedback. To fill these gaps, we propose a novel method called ProtET for efficient CLIP-informed protein editing through multi-modality learning. Our approach comprises 2 stages: In the pretraining stage, contrastive learning aligns protein-biotext representations encoded by 2 large language models (LLMs). Subsequently, during the protein editing stage, the fused features from editing instruction texts and original protein sequences serve as the final editing condition for generating target protein sequences. Comprehensive experiments demonstrated the superiority of ProtET in editing proteins to enhance human-expected functionality across multiple attribute domains, including enzyme catalytic activity, protein stability, and antibody-specific binding ability. ProtET improves the state-of-the-art results by a large margin, leading to substantial stability improvements of 16.67% and 16.90%. This capability positions ProtET to advance real-world artificial protein editing, potentially addressing unmet academic, industrial, and clinical needs.
蛋白质掌控着生命中大多数至关重要的生物学功能,实现可控的蛋白质编辑在探索自然系统、创建治疗性偶联物以及生成新型蛋白质构建体方面取得了巨大进展。最近,机器学习辅助蛋白质编辑(MLPE)在加速优化周期和减少实验工作量方面展现出了潜力。然而,当前方法在面对潜在蛋白质编辑的巨大组合空间时面临困难,并且无法使用生物文本指令明确地进行蛋白质编辑,这限制了它们与人类反馈的交互性。为了填补这些空白,我们提出了一种名为ProtET的新方法,用于通过多模态学习进行高效的基于CLIP的蛋白质编辑。我们的方法包括两个阶段:在预训练阶段,对比学习将由两个大语言模型(LLMs)编码的蛋白质 - 生物文本表示进行对齐。随后,在蛋白质编辑阶段,来自编辑指令文本和原始蛋白质序列的融合特征用作生成目标蛋白质序列的最终编辑条件。全面的实验证明了ProtET在编辑蛋白质以增强跨多个属性域的人类期望功能方面的优越性,这些属性域包括酶催化活性、蛋白质稳定性和抗体特异性结合能力。ProtET大幅提高了现有技术的结果,使稳定性大幅提高了16.67%和16.90%。这种能力使ProtET能够推动实际的人工蛋白质编辑,有可能满足未满足的学术、工业和临床需求。