Department of Biochemistry, Stanford University School of Medicine, Stanford, CA, 94305, USA; Stanford ChEM-H, Stanford University, Stanford, CA, 94305, USA.
Microsoft Research New England, Cambridge, MA, 02142, USA.
Curr Opin Struct Biol. 2022 Feb;72:145-152. doi: 10.1016/j.sbi.2021.11.002. Epub 2021 Dec 9.
Machine-learning models that learn from data to predict how protein sequence encodes function are emerging as a useful protein engineering tool. However, when using these models to suggest new protein designs, one must deal with the vast combinatorial complexity of protein sequences. Here, we review how to use a sequence-to-function machine-learning surrogate model to select sequences for experimental measurement. First, we discuss how to select sequences through a single round of machine-learning optimization. Then, we discuss sequential optimization, where the goal is to discover optimized sequences and improve the model across multiple rounds of training, optimization, and experimental measurement.
基于数据学习以预测蛋白质序列如何编码功能的机器学习模型正在成为一种有用的蛋白质工程工具。然而,当使用这些模型来建议新的蛋白质设计时,必须处理蛋白质序列的巨大组合复杂性。在这里,我们回顾如何使用序列到功能的机器学习替代模型来选择用于实验测量的序列。首先,我们讨论如何通过单次机器学习优化来选择序列。然后,我们讨论顺序优化,其目标是在多个训练、优化和实验测量的轮次中发现优化序列并改进模型。