School of Computing, National University of Singapore, Singapore 117417, Singapore.
Science Center for Future Foods, Jiangnan University, Wuxi 214122, PR China.
ACS Synth Biol. 2024 Sep 20;13(9):2960-2968. doi: 10.1021/acssynbio.4c00371. Epub 2024 Sep 4.
N-terminal coding sequence (NCS) influences gene expression by impacting the translation initiation rate. The NCS optimization problem is to find an NCS that maximizes gene expression. The problem is important in genetic engineering. However, current methods for NCS optimization such as rational design and statistics-guided approaches are labor-intensive yield only relatively small improvements. This paper introduces a deep learning/synthetic biology codesigned few-shot training workflow for NCS optimization. Our method utilizes -nearest encoding followed by word2vec to encode the NCS, then performs feature extraction using attention mechanisms, before constructing a time-series network for predicting gene expression intensity, and finally a direct search algorithm identifies the optimal NCS with limited training data. We took green fluorescent protein (GFP) expressed by as a reporting protein of NCSs, and employed the fluorescence enhancement factor as the metric of NCS optimization. Within just six iterative experiments, our model generated an NCS (MLD) that increased average GFP expression by 5.41-fold, outperforming the state-of-the-art NCS designs. Extending our findings beyond GFP, we showed that our engineered NCS (MLD) can effectively boost the production of N-acetylneuraminic acid by enhancing the expression of the crucial rate-limiting gene, demonstrating its practical utility. We have open-sourced our NCS expression database and experimental procedures for public use.
N 端编码序列 (NCS) 通过影响翻译起始速率来影响基因表达。NCS 优化问题是找到一个能够最大限度地提高基因表达的 NCS。这个问题在基因工程中非常重要。然而,目前的 NCS 优化方法,如理性设计和统计指导方法,都很耗时,只能取得相对较小的改进。本文介绍了一种深度学习/合成生物学联合设计的少样本训练工作流程,用于 NCS 优化。我们的方法使用最近邻编码和 word2vec 对 NCS 进行编码,然后使用注意力机制进行特征提取,再构建一个时间序列网络来预测基因表达强度,最后通过直接搜索算法在有限的训练数据中确定最优 NCS。我们以绿色荧光蛋白 (GFP) 为报告蛋白,采用荧光增强因子作为 NCS 优化的指标。在仅仅六次迭代实验中,我们的模型生成了一个 NCS (MLD),使 GFP 的平均表达量提高了 5.41 倍,优于最先进的 NCS 设计。我们将我们的发现扩展到 GFP 之外,展示了我们设计的工程 NCS (MLD) 可以通过增强关键限速酶的表达来有效提高 N-乙酰神经氨酸的产量,证明了其实际应用价值。我们已经开源了我们的 NCS 表达数据库和实验程序,供公众使用。