Department of Bioengineering and Therapeutic Sciences, University of California, San Francisco, California, United States of America.
Quantitative Biosciences Institute, University of California, San Francisco, California, United States of America.
PLoS Comput Biol. 2024 Jul 11;20(7):e1011953. doi: 10.1371/journal.pcbi.1011953. eCollection 2024 Jul.
With recent methodological advances in the field of computational protein design, in particular those based on deep learning, there is an increasing need for frameworks that allow for coherent, direct integration of different models and objective functions into the generative design process. Here we demonstrate how evolutionary multiobjective optimization techniques can be adapted to provide such an approach. With the established Non-dominated Sorting Genetic Algorithm II (NSGA-II) as the optimization framework, we use AlphaFold2 and ProteinMPNN confidence metrics to define the objective space, and a mutation operator composed of ESM-1v and ProteinMPNN to rank and then redesign the least favorable positions. Using the two-state design problem of the foldswitching protein RfaH as an in-depth case study, and PapD and calmodulin as examples of higher-dimensional design problems, we show that the evolutionary multiobjective optimization approach leads to significant reduction in the bias and variance in RfaH native sequence recovery, compared to a direct application of ProteinMPNN. We suggest that this improvement is due to three factors: (i) the use of an informative mutation operator that accelerates the sequence space exploration, (ii) the parallel, iterative design process inherent to the genetic algorithm that improves upon the ProteinMPNN autoregressive sequence decoding scheme, and (iii) the explicit approximation of the Pareto front that leads to optimal design candidates representing diverse tradeoff conditions. We anticipate this approach to be readily adaptable to different models and broadly relevant for protein design tasks with complex specifications.
随着计算蛋白质设计领域的方法学进展,特别是基于深度学习的方法学进展,越来越需要能够将不同模型和目标函数连贯、直接地整合到生成设计过程中的框架。在这里,我们展示了如何适应进化多目标优化技术来提供这样一种方法。我们以已建立的非支配排序遗传算法 II(NSGA-II)作为优化框架,使用 AlphaFold2 和 ProteinMPNN 置信度指标来定义目标空间,并使用由 ESM-1v 和 ProteinMPNN 组成的突变算子对最不利的位置进行排名和重新设计。通过使用折叠开关蛋白 RfaH 的两态设计问题作为深入的案例研究,并以 PapD 和钙调蛋白作为更高维设计问题的示例,我们表明,与直接应用 ProteinMPNN 相比,进化多目标优化方法可显著降低 RfaH 天然序列恢复中的偏差和方差。我们认为这种改进归因于三个因素:(i)使用信息丰富的突变算子,加速序列空间探索,(ii)遗传算法固有的并行、迭代设计过程,改进了 ProteinMPNN 的自回归序列解码方案,以及(iii)对 Pareto 前沿的明确逼近,导致代表不同权衡条件的最优设计候选。我们预计这种方法很容易适应不同的模型,并广泛适用于具有复杂规范的蛋白质设计任务。