Wyss Institute for Biologically Inspired Engineering, Harvard University, Boston, MA, USA.
Nabla Bio, Inc., Boston, MA, USA.
Nat Methods. 2021 Apr;18(4):389-396. doi: 10.1038/s41592-021-01100-y. Epub 2021 Apr 7.
Protein engineering has enormous academic and industrial potential. However, it is limited by the lack of experimental assays that are consistent with the design goal and sufficiently high throughput to find rare, enhanced variants. Here we introduce a machine learning-guided paradigm that can use as few as 24 functionally assayed mutant sequences to build an accurate virtual fitness landscape and screen ten million sequences via in silico directed evolution. As demonstrated in two dissimilar proteins, GFP from Aequorea victoria (avGFP) and E. coli strain TEM-1 β-lactamase, top candidates from a single round are diverse and as active as engineered mutants obtained from previous high-throughput efforts. By distilling information from natural protein sequence landscapes, our model learns a latent representation of 'unnaturalness', which helps to guide search away from nonfunctional sequence neighborhoods. Subsequent low-N supervision then identifies improvements to the activity of interest. In sum, our approach enables efficient use of resource-intensive high-fidelity assays without sacrificing throughput, and helps to accelerate engineered proteins into the fermenter, field and clinic.
蛋白质工程具有巨大的学术和工业潜力。然而,它受到缺乏与设计目标一致且高通量足以发现稀有增强变体的实验测定的限制。在这里,我们介绍了一种机器学习指导的范例,它可以使用多达 24 个功能测定的突变序列来构建准确的虚拟适应度景观,并通过计算机指导的进化筛选 1000 万个序列。在两个不同的蛋白质(维多利亚水母 GFP(avGFP)和大肠杆菌 TEM-1 β-内酰胺酶)中进行的演示表明,单轮筛选的最佳候选者具有多样性,并且与以前高通量努力获得的工程突变体一样活跃。通过从天然蛋白质序列景观中提取信息,我们的模型学习了“非自然”的潜在表示,这有助于引导搜索远离非功能序列区域。随后的低 N 监督则可以识别出对目标活性的改进。总之,我们的方法能够在不牺牲通量的情况下高效利用资源密集型高保真度测定,有助于将工程蛋白加速推向发酵罐、田间和临床。