从 5'UTR 序列跨情境预测蛋白质表达的迁移学习

Transfer learning for cross-context prediction of protein expression from 5'UTR sequence.

机构信息

School of Biological Sciences, University of Bristol, 24 Tyndall Avenue, Bristol BS8 1TQ, UK.

BrisEngBio, School of Chemistry, University of Bristol, Cantock's Close, Bristol BS8 1TS, UK.

出版信息

Nucleic Acids Res. 2024 Jul 22;52(13):e58. doi: 10.1093/nar/gkae491.

DOI:10.1093/nar/gkae491

PMID:38864396

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11260469/

Abstract

Model-guided DNA sequence design can accelerate the reprogramming of living cells. It allows us to engineer more complex biological systems by removing the need to physically assemble and test each potential design. While mechanistic models of gene expression have seen some success in supporting this goal, data-centric, deep learning-based approaches often provide more accurate predictions. This accuracy, however, comes at a cost - a lack of generalization across genetic and experimental contexts that has limited their wider use outside the context in which they were trained. Here, we address this issue by demonstrating how a simple transfer learning procedure can effectively tune a pre-trained deep learning model to predict protein translation rate from 5' untranslated region (5'UTR) sequence for diverse contexts in Escherichia coli using a small number of new measurements. This allows for important model features learnt from expensive massively parallel reporter assays to be easily transferred to new settings. By releasing our trained deep learning model and complementary calibration procedure, this study acts as a starting point for continually refined model-based sequence design that builds on previous knowledge and future experimental efforts.

摘要

基于模型的 DNA 序列设计可以加速活细胞的重编程。它通过消除对每个潜在设计进行物理组装和测试的需求，使我们能够设计更复杂的生物系统。虽然基因表达的机械模型在支持这一目标方面取得了一些成功，但基于数据的深度学习方法通常能提供更准确的预测。然而，这种准确性是有代价的——缺乏在遗传和实验背景下的泛化能力，这限制了它们在训练环境之外的更广泛应用。在这里，我们通过演示如何使用少量新测量值，通过简单的迁移学习过程，有效地调整预先训练的深度学习模型，以预测大肠杆菌中不同环境下的 5'非翻译区 (5'UTR) 序列的蛋白质翻译速率，从而解决了这个问题。这使得从昂贵的大规模平行报告基因实验中学习到的重要模型特征可以很容易地转移到新的环境中。通过发布我们训练好的深度学习模型和补充的校准程序，本研究为不断完善基于模型的序列设计提供了一个起点，该设计建立在以往的知识和未来的实验工作基础上。