Luo Jiaqi, Ding Kerr, Luo Yunan
School of Computational Science and Engineering, Georgia Institute of Technology, Atlanta, GA 30308, USA.
iScience. 2025 Feb 27;28(3):112119. doi: 10.1016/j.isci.2025.112119. eCollection 2025 Mar 21.
Supervised machine learning (ML) has significantly advanced sequence-based protein property prediction. However, its inverse application, designing protein sequences with desired properties, remains under-explored. The challenges in sequence design stem from the vast search space and the rugged protein fitness landscape. In this work, we present MosPro, an efficient ML algorithm for property-guided protein sequence design. We frame sequence design as a discrete sampling problem. Utilizing a pre-trained differentiable ML model that predicts properties of sequences, MosPro shapes a distribution that assigns high probability mass to regions for high-property sequences. To generate designs, MosPro efficiently samples sequences from this constructed distribution. We further develop a Pareto optimization algorithm to propose sequences that are simultaneously optimized for multiple properties. Evaluations on experimental fitness landscapes demonstrated that MosPro generates sequences that optimally trade off multiple desiderata. Our results suggested an unparalleled potential of generative ML for efficient and controllable design for functional proteins.
监督式机器学习(ML)在基于序列的蛋白质特性预测方面取得了显著进展。然而,其反向应用,即设计具有所需特性的蛋白质序列,仍有待深入探索。序列设计中的挑战源于巨大的搜索空间和崎岖的蛋白质适应度景观。在这项工作中,我们提出了MosPro,一种用于特性引导的蛋白质序列设计的高效机器学习算法。我们将序列设计框架化为一个离散采样问题。利用一个预训练的可微机器学习模型来预测序列的特性,MosPro塑造了一种分布,该分布将高概率质量分配给高特性序列的区域。为了生成设计,MosPro从这个构建的分布中高效地采样序列。我们进一步开发了一种帕累托优化算法,以提出针对多种特性同时进行优化的序列。对实验适应度景观的评估表明,MosPro生成的序列能够在多个需求之间进行最佳权衡。我们的结果表明,生成式机器学习在功能性蛋白质的高效和可控设计方面具有无与伦比的潜力。