Division of Biology and Bioengineering, California Institute of Technology, Pasadena, CA 91125.
Division of Chemistry and Chemical Engineering, California Institute of Technology, Pasadena, CA 91125.
Proc Natl Acad Sci U S A. 2024 Aug 6;121(32):e2400439121. doi: 10.1073/pnas.2400439121. Epub 2024 Jul 29.
Protein engineering often targets binding pockets or active sites which are enriched in epistasis-nonadditive interactions between amino acid substitutions-and where the combined effects of multiple single substitutions are difficult to predict. Few existing sequence-fitness datasets capture epistasis at large scale, especially for enzyme catalysis, limiting the development and assessment of model-guided enzyme engineering approaches. We present here a combinatorially complete, 160,000-variant fitness landscape across four residues in the active site of an enzyme. Assaying the native reaction of a thermostable β-subunit of tryptophan synthase (TrpB) in a nonnative environment yielded a landscape characterized by significant epistasis and many local optima. These effects prevent simulated directed evolution approaches from efficiently reaching the global optimum. There is nonetheless wide variability in the effectiveness of different directed evolution approaches, which together provide experimental benchmarks for computational and machine learning workflows. The most-fit TrpB variants contain a substitution that is nearly absent in natural TrpB sequences-a result that conservation-based predictions would not capture. Thus, although fitness prediction using evolutionary data can enrich in more-active variants, these approaches struggle to identify and differentiate among the most-active variants, even for this near-native function. Overall, this work presents a large-scale testing ground for model-guided enzyme engineering and suggests that efficient navigation of epistatic fitness landscapes can be improved by advances in both machine learning and physical modeling.
蛋白质工程通常针对结合口袋或活性位点,这些口袋或活性位点富含氨基酸取代之间的非加性相互作用(上位性),并且多个单取代的综合效应难以预测。很少有现有的序列适应性数据集可以大规模捕获上位性,特别是对于酶催化,这限制了模型指导的酶工程方法的开发和评估。我们在这里展示了一个组合完整的,在酶的活性位点中四个残基的 160,000 个变体适应性景观。在非天然环境中测定热稳定色氨酸合酶(TrpB)的β亚基的天然反应,得到了一个具有显著上位性和许多局部最优的景观。这些效应阻止了模拟定向进化方法有效地达到全局最优。然而,不同定向进化方法的有效性存在很大差异,这些方法共同为计算和机器学习工作流程提供了实验基准。适应性最强的 TrpB 变体包含一个在天然 TrpB 序列中几乎不存在的取代,这是保守性预测无法捕获的结果。因此,尽管使用进化数据进行适应性预测可以富集更活跃的变体,但这些方法难以识别和区分最活跃的变体,即使对于这种近乎天然的功能也是如此。总体而言,这项工作为模型指导的酶工程提供了一个大规模的测试平台,并表明机器学习和物理建模的进步可以改善上位性适应性景观的有效导航。