Qiu Yuchi, Hu Jian, Wei Guo-Wei
Department of Mathematics, Michigan State University, East Lansing, MI 48824, USA.
Department of Chemistry, Michigan State University, MI, 48824, USA.
Nat Comput Sci. 2021 Dec;1(12):809-818. doi: 10.1038/s43588-021-00168-y. Epub 2021 Dec 9.
Directed evolution, a strategy for protein engineering, optimizes protein properties (i.e., fitness) by expensive and time-consuming screening or selection of large mutational sequence space. Machine learning-assisted directed evolution (MLDE), which screens sequence properties , can accelerate the optimization and reduce the experimental burden. This work introduces a MLDE framework, cluster learning-assisted directed evolution (CLADE), that combines hierarchical unsupervised clustering sampling and supervised learning to guide protein engineering. The clustering sampling selectively picks and screens variants in targeted subspaces, which guides the subsequent generation of diverse training sets. In the last stage, accurate predictions via supervised learning models improve final outcomes. By sequentially screening 480 sequences out of 160,000 in a four-site combinatorial library with five equal experimental batches, CLADE achieves the global maximal fitness hit rate up to 91.0% and 34.0% for GB1 and PhoQ datasets, respectively, improved from 18.6% and 7.2% obtained by random-sampling-based MLDE.
定向进化是一种蛋白质工程策略,通过对庞大的突变序列空间进行昂贵且耗时的筛选或选择来优化蛋白质特性(即适应性)。机器学习辅助定向进化(MLDE)通过筛选序列特性,可以加速优化过程并减轻实验负担。这项工作引入了一个MLDE框架,即聚类学习辅助定向进化(CLADE),它将分层无监督聚类采样和监督学习相结合以指导蛋白质工程。聚类采样在目标子空间中选择性地挑选和筛选变体,这指导了后续多样化训练集的生成。在最后阶段,通过监督学习模型进行的准确预测可改善最终结果。通过在一个具有五个相等实验批次的四点组合文库中从160,000个序列中依次筛选出480个序列,CLADE在GB1和PhoQ数据集上分别实现了高达91.0%和34.0%的全局最大适应性命中率,相比基于随机采样的MLDE所获得的18.6%和7.2%有所提高。