Suppr超能文献

通过DNA记录和上位性感知机器学习进行数据驱动的蛋白酶工程。

Data-driven protease engineering by DNA-recording and epistasis-aware machine learning.

作者信息

Huber Lukas, Kucera Tim, Höllerer Simon, Borgwardt Karsten, Panke Sven, Jeschek Markus

机构信息

Department of Biosystems Science and Engineering, ETH Zurich, Basel, Switzerland.

Swiss Institute of Bioinformatics, Basel, Switzerland.

出版信息

Nat Commun. 2025 Jul 1;16(1):5466. doi: 10.1038/s41467-025-60622-7.

Abstract

Protein engineering has recently seen tremendous transformation due to machine learning (ML) tools that predict structure from sequence at unprecedented precision. Predicting catalytic activity, however, remains challenging, restricting our capabilities to design protein sequences with desired catalytic function in silico. This predicament is mainly rooted in a lack of experimental methods capable of recording sequence-activity data in quantities sufficient for data-intensive ML techniques, and the inefficiency of searches in the enormous sequence spaces inherent to proteins. Herein, we address both limitations in the context of engineering proteases with tailored substrate specificity. We introduce a DNA recorder for deep specificity profiling of proteases in Escherichia coli as we demonstrate testing 29,716 candidate proteases against up to 134 substrates in parallel. The resulting sequence-activity data on approximately 600,000 protease-substrate pairs does not only reveal key sequence determinants governing protease specificity, but allows to build a data-efficient deep learning model that accurately predicts protease sequences with desired on- and off-target activities. Moreover, we present epistasis-aware training set design as a generalizable strategy to streamline searches within enormous sequence spaces, which strongly increases model accuracy at given experimental efforts and is thus likely to have implications for protein engineering far beyond proteases.

摘要

由于机器学习(ML)工具能够以前所未有的精度从序列预测结构,蛋白质工程最近发生了巨大变革。然而,预测催化活性仍然具有挑战性,限制了我们在计算机上设计具有所需催化功能的蛋白质序列的能力。这一困境主要源于缺乏能够记录足够数量的序列-活性数据以用于数据密集型ML技术的实验方法,以及在蛋白质固有的巨大序列空间中搜索效率低下。在此,我们在设计具有定制底物特异性的蛋白酶的背景下解决这两个限制。我们引入了一种DNA记录器,用于在大肠杆菌中对蛋白酶进行深度特异性分析,因为我们展示了同时针对多达134种底物测试29716种候选蛋白酶。由此产生的关于大约60万对蛋白酶-底物的序列-活性数据不仅揭示了决定蛋白酶特异性的关键序列决定因素,还允许构建一个数据高效的深度学习模型,该模型能够准确预测具有所需靶向和脱靶活性的蛋白酶序列。此外,我们提出了上位性感知训练集设计,作为一种可推广的策略,以简化在巨大序列空间中的搜索,这在给定的实验工作量下显著提高了模型准确性,因此可能对远远超出蛋白酶的蛋白质工程产生影响。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验