Institutes of Physical Science and Information Technology, Anhui University, Hefei, China.
Anhui Provincial Key Laboratory of Multimodal Cognitive Computation, School of Computer Science and Technology, Anhui University, Hefei, China.
Methods. 2022 Aug;204:38-46. doi: 10.1016/j.ymeth.2022.03.017. Epub 2022 Mar 31.
Promoter is a key DNA element located near the transcription start site, which regulates gene transcription by binding RNA polymerase. Thus, the identification of promoters is an important research field in synthetic biology. Nannochloropsis is an important unicellular industrial oleaginous microalgae, and at present, some studies have identified some promoters with specific functions by biological methods in Nannochloropsis, whereas few studies used computational methods. Here, we propose a method called DNPPro (DenseNet-Predict-Promoter) based on densely connected convolutional neural networks to predict the promoter of Nannochloropsis. First, we collected promoter sequences from six Nannochloropsis strains and removed 80% similarity using CD-HIT for each strain to yield a reliable set of positive datasets. Then, in order to construct a robust classifier, within-group scrambling method was used to generate negative dataset which overcomes the limitation of randomly selecting a non-promoter region from the same genome as a negative sample. Finally, we constructed a densely connected convolutional neural network, with the sequence one-hot encoding as the input. Compared with commonly used sequence processing methods, DNPPro can extract long sequence features to a greater extent. The cross-strain experiment on independent dataset verifies the generalization of our method. At the same time, T-SNE visualization analysis shows that our method can effectively distinguish promoters from non-promoters.
启动子是位于转录起始位点附近的关键 DNA 元件,通过与 RNA 聚合酶结合来调节基因转录。因此,启动子的鉴定是合成生物学的一个重要研究领域。微拟球藻是一种重要的单细胞工业产油微藻,目前已经通过生物方法在微拟球藻中鉴定出了一些具有特定功能的启动子,而很少有研究使用计算方法。在这里,我们提出了一种基于密集连接卷积神经网络的方法 DNPPro(DenseNet-Predict-Promoter),用于预测微拟球藻的启动子。首先,我们从六个微拟球藻菌株中收集启动子序列,并使用 CD-HIT 对每个菌株去除 80%的相似性,以获得一组可靠的阳性数据集。然后,为了构建一个稳健的分类器,我们使用组内随机打乱方法生成负数据集,克服了从同一基因组中随机选择非启动子区域作为负样本的局限性。最后,我们构建了一个密集连接的卷积神经网络,以序列的 one-hot 编码作为输入。与常用的序列处理方法相比,DNPPro 可以更大程度地提取长序列特征。在独立数据集上的跨菌株实验验证了我们方法的泛化能力。同时,T-SNE 可视化分析表明,我们的方法可以有效地将启动子与非启动子区分开来。