Institute of Biotechnology, RWTH Aachen University, Worringerweg 3, 52074 Aachen, Germany.
Department of Bioorganic Chemistry, Leibniz Institute of Plant Biochemistry, Weinberg 3, 06120 Halle, Germany.
J Chem Inf Model. 2024 Aug 26;64(16):6350-6360. doi: 10.1021/acs.jcim.4c00704. Epub 2024 Aug 1.
Protein engineering through directed evolution and (semi)rational approaches is routinely applied to optimize protein properties for a broad range of applications in industry and academia. The multitude of possible variants, combined with limited screening throughput, hampers efficient protein engineering. Data-driven strategies have emerged as a powerful tool to model the protein fitness landscape that can be explored , significantly accelerating protein engineering campaigns. However, such methods require a certain amount of data, which often cannot be provided, to generate a reliable model of the fitness landscape. Here, we introduce MERGE, a method that combines direct coupling analysis (DCA) and machine learning (ML). MERGE enables data-driven protein engineering when only limited data are available for training, typically ranging from 50 to 500 labeled sequences. Our method demonstrates remarkable performance in predicting a protein's fitness value and rank based on its sequence across diverse proteins and properties. Notably, MERGE outperforms state-of-the-art methods when only small data sets are available for modeling, requiring fewer computational resources, and proving particularly promising for protein engineers who have access to limited amounts of data.
通过定向进化和(半)理性方法进行蛋白质工程,已被广泛应用于优化蛋白质特性,以满足工业和学术界的广泛需求。大量可能的变体与有限的筛选通量相结合,阻碍了有效的蛋白质工程。数据驱动的策略已成为建模蛋白质适应性景观的强大工具,可以探索该景观,从而显著加速蛋白质工程的开展。然而,这些方法需要一定数量的数据才能生成适应性景观的可靠模型,但通常无法提供。在这里,我们引入了 MERGE,一种结合直接耦合分析(DCA)和机器学习(ML)的方法。当只有有限的数据可用于训练时,MERGE 可以实现数据驱动的蛋白质工程,通常范围在 50 到 500 个标记序列之间。我们的方法在基于序列预测不同蛋白质和特性的蛋白质适应性值和排名方面表现出色。值得注意的是,当只有小的数据集可用于建模时,MERGE 优于最先进的方法,所需的计算资源更少,对于只能访问有限数量数据的蛋白质工程师来说尤其有前景。