Jiang Kaiyi, Yan Zhaoqing, Di Bernardo Matteo, Sgrizzi Samantha R, Villiger Lukas, Kayabolen Alisan, Kim Byungji, Carscadden Josephine K, Hiraizumi Masahiro, Nishimasu Hiroshi, Gootenberg Jonathan S, Abudayyeh Omar O
Department of Medicine Division of Engineering in Medicine Brigham and Women's Hospital Harvard Medical School Boston, 02115 MA, USA.
Gene and Cell Therapy Institute Mass General Brigham Cambridge, 02139 MA, USA.
bioRxiv. 2024 Jul 18:2024.07.17.604015. doi: 10.1101/2024.07.17.604015.
Directed evolution of proteins is critical for applications in basic biological research, therapeutics, diagnostics, and sustainability. However, directed evolution methods are labor intensive, cannot efficiently optimize over multiple protein properties, and are often trapped by local maxima. -directed evolution methods incorporating protein language models (PLMs) have the potential to accelerate this engineering process, but current approaches fail to generalize across diverse protein families. We introduce EVOLVEpro, a few-shot active learning framework to rapidly improve protein activity using a combination of PLMs and protein activity predictors, achieving improved activity with as few as four rounds of evolution. EVOLVEpro substantially enhances the efficiency and effectiveness of protein evolution, surpassing current state-of-the-art methods and yielding proteins with up to 100-fold improvement of desired properties. We showcase EVOLVEpro for five proteins across three applications: T7 RNA polymerase for RNA production, a miniature CRISPR nuclease, a prime editor, and an integrase for genome editing, and a monoclonal antibody for epitope binding. These results demonstrate the advantages of few-shot active learning with small amounts of experimental data over zero-shot predictions. EVOLVEpro paves the way for broader applications of AI-guided protein engineering in biology and medicine.
蛋白质的定向进化对于基础生物学研究、治疗学、诊断学和可持续性等应用至关重要。然而,定向进化方法劳动强度大,无法有效优化多种蛋白质特性,且常常陷入局部最优。结合蛋白质语言模型(PLM)的定向进化方法有加速这一工程过程的潜力,但目前的方法无法在不同蛋白质家族中通用。我们引入了EVOLVEpro,这是一种少样本主动学习框架,通过结合PLM和蛋白质活性预测器快速提高蛋白质活性,只需四轮进化就能实现活性提升。EVOLVEpro显著提高了蛋白质进化的效率和效果,超越了当前的最先进方法,产生的蛋白质所需特性提高了多达100倍。我们展示了EVOLVEpro在三种应用中的五种蛋白质上的效果:用于RNA生产的T7 RNA聚合酶、一种微型CRISPR核酸酶、一种碱基编辑器、一种用于基因组编辑的整合酶以及一种用于表位结合的单克隆抗体。这些结果证明了利用少量实验数据进行少样本主动学习相对于零样本预测的优势。EVOLVEpro为人工智能指导的蛋白质工程在生物学和医学中的更广泛应用铺平了道路。