Suppr超能文献

进化概率和堆叠回归可实现最小化实验投入的数据驱动蛋白质工程。

Evolutionary Probability and Stacked Regressions Enable Data-Driven Protein Engineering with Minimized Experimental Effort.

机构信息

Institute of Biotechnology, RWTH Aachen University, Worringerweg 3, 52074 Aachen, Germany.

Department of Bioorganic Chemistry, Leibniz Institute of Plant Biochemistry, Weinberg 3, 06120 Halle, Germany.

出版信息

J Chem Inf Model. 2024 Aug 26;64(16):6350-6360. doi: 10.1021/acs.jcim.4c00704. Epub 2024 Aug 1.

Abstract

Protein engineering through directed evolution and (semi)rational approaches is routinely applied to optimize protein properties for a broad range of applications in industry and academia. The multitude of possible variants, combined with limited screening throughput, hampers efficient protein engineering. Data-driven strategies have emerged as a powerful tool to model the protein fitness landscape that can be explored , significantly accelerating protein engineering campaigns. However, such methods require a certain amount of data, which often cannot be provided, to generate a reliable model of the fitness landscape. Here, we introduce MERGE, a method that combines direct coupling analysis (DCA) and machine learning (ML). MERGE enables data-driven protein engineering when only limited data are available for training, typically ranging from 50 to 500 labeled sequences. Our method demonstrates remarkable performance in predicting a protein's fitness value and rank based on its sequence across diverse proteins and properties. Notably, MERGE outperforms state-of-the-art methods when only small data sets are available for modeling, requiring fewer computational resources, and proving particularly promising for protein engineers who have access to limited amounts of data.

摘要

通过定向进化和(半)理性方法进行蛋白质工程,已被广泛应用于优化蛋白质特性,以满足工业和学术界的广泛需求。大量可能的变体与有限的筛选通量相结合,阻碍了有效的蛋白质工程。数据驱动的策略已成为建模蛋白质适应性景观的强大工具,可以探索该景观,从而显著加速蛋白质工程的开展。然而,这些方法需要一定数量的数据才能生成适应性景观的可靠模型,但通常无法提供。在这里,我们引入了 MERGE,一种结合直接耦合分析(DCA)和机器学习(ML)的方法。当只有有限的数据可用于训练时,MERGE 可以实现数据驱动的蛋白质工程,通常范围在 50 到 500 个标记序列之间。我们的方法在基于序列预测不同蛋白质和特性的蛋白质适应性值和排名方面表现出色。值得注意的是,当只有小的数据集可用于建模时,MERGE 优于最先进的方法,所需的计算资源更少,对于只能访问有限数量数据的蛋白质工程师来说尤其有前景。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验