Department of Chemical Engineering, Pohang University of Science and Technology (POSTECH), 77 Cheongam-Ro, Nam-Gu, Pohang, Gyeongbuk37673, Korea.
School of Energy and Chemical Engineering, Ulsan National Institute of Science and Technology (UNIST), 50 UNIST-Gil, Eonyang-Eup, Ulsan44919, Korea.
Nucleic Acids Res. 2023 Jul 21;51(13):7071-7082. doi: 10.1093/nar/gkad451.
Deep generative models, which can approximate complex data distribution from large datasets, are widely used in biological dataset analysis. In particular, they can identify and unravel hidden traits encoded within a complicated nucleotide sequence, allowing us to design genetic parts with accuracy. Here, we provide a deep-learning based generic framework to design and evaluate synthetic promoters for cyanobacteria using generative models, which was in turn validated with cell-free transcription assay. We developed a deep generative model and a predictive model using a variational autoencoder and convolutional neural network, respectively. Using native promoter sequences of the model unicellular cyanobacterium Synechocystis sp. PCC 6803 as a training dataset, we generated 10 000 synthetic promoter sequences and predicted their strengths. By position weight matrix and k-mer analyses, we confirmed that our model captured a valid feature of cyanobacteria promoters from the dataset. Furthermore, critical subregion identification analysis consistently revealed the importance of the -10 box sequence motif in cyanobacteria promoters. Moreover, we validated that the generated promoter sequence can efficiently drive transcription via cell-free transcription assay. This approach, combining in silico and in vitro studies, will provide a foundation for the rapid design and validation of synthetic promoters, especially for non-model organisms.
深度生成模型可以从大型数据集近似复杂的数据分布,广泛应用于生物数据集分析。特别是,它们可以识别和揭示复杂核苷酸序列中编码的隐藏特征,从而使我们能够精确地设计遗传元件。在这里,我们提供了一个基于深度学习的通用框架,使用生成模型来设计和评估蓝藻的合成启动子,并用无细胞转录测定法对其进行验证。我们分别使用变分自动编码器和卷积神经网络开发了一个深度生成模型和一个预测模型。使用模型单细胞蓝藻 Synechocystis sp. PCC 6803 的天然启动子序列作为训练数据集,我们生成了 10000 个合成启动子序列并预测了它们的强度。通过位置权重矩阵和 k-mer 分析,我们证实了我们的模型从数据集中捕获了蓝藻启动子的有效特征。此外,关键亚区识别分析一致表明,-10 框序列基序在蓝藻启动子中非常重要。此外,我们通过无细胞转录测定验证了所生成的启动子序列可以有效地驱动转录。这种结合了计算机模拟和体外研究的方法将为合成启动子的快速设计和验证提供基础,特别是对于非模式生物。