使用卷积深度学习神经网络识别原核生物和真核生物启动子。

Recognition of prokaryotic and eukaryotic promoters using convolutional deep learning neural networks.

作者信息

Umarov Ramzan Kh, Solovyev Victor V

机构信息

King Abdullah University of Science and Technology, Thuwal, Saudi Arabia.

Softberry Inc., Mount Kisco, United States of America.

出版信息

PLoS One. 2017 Feb 3;12(2):e0171410. doi: 10.1371/journal.pone.0171410. eCollection 2017.

DOI:10.1371/journal.pone.0171410

PMID:28158264

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC5291440/

Abstract

Accurate computational identification of promoters remains a challenge as these key DNA regulatory regions have variable structures composed of functional motifs that provide gene-specific initiation of transcription. In this paper we utilize Convolutional Neural Networks (CNN) to analyze sequence characteristics of prokaryotic and eukaryotic promoters and build their predictive models. We trained a similar CNN architecture on promoters of five distant organisms: human, mouse, plant (Arabidopsis), and two bacteria (Escherichia coli and Bacillus subtilis). We found that CNN trained on sigma70 subclass of Escherichia coli promoter gives an excellent classification of promoters and non-promoter sequences (Sn = 0.90, Sp = 0.96, CC = 0.84). The Bacillus subtilis promoters identification CNN model achieves Sn = 0.91, Sp = 0.95, and CC = 0.86. For human, mouse and Arabidopsis promoters we employed CNNs for identification of two well-known promoter classes (TATA and non-TATA promoters). CNN models nicely recognize these complex functional regions. For human promoters Sn/Sp/CC accuracy of prediction reached 0.95/0.98/0,90 on TATA and 0.90/0.98/0.89 for non-TATA promoter sequences, respectively. For Arabidopsis we observed Sn/Sp/CC 0.95/0.97/0.91 (TATA) and 0.94/0.94/0.86 (non-TATA) promoters. Thus, the developed CNN models, implemented in CNNProm program, demonstrated the ability of deep learning approach to grasp complex promoter sequence characteristics and achieve significantly higher accuracy compared to the previously developed promoter prediction programs. We also propose random substitution procedure to discover positionally conserved promoter functional elements. As the suggested approach does not require knowledge of any specific promoter features, it can be easily extended to identify promoters and other complex functional regions in sequences of many other and especially newly sequenced genomes. The CNNProm program is available to run at web server http://www.softberry.com.

摘要

准确地通过计算识别启动子仍然是一项挑战，因为这些关键的DNA调控区域具有由功能基序组成的可变结构，这些功能基序可提供基因特异性的转录起始。在本文中，我们利用卷积神经网络（CNN）来分析原核和真核启动子的序列特征并构建其预测模型。我们在五种远缘生物的启动子上训练了类似的CNN架构：人类、小鼠、植物（拟南芥）以及两种细菌（大肠杆菌和枯草芽孢杆菌）。我们发现，在大肠杆菌启动子的sigma70亚类上训练的CNN对启动子和非启动子序列进行了出色的分类（Sn = 0.90，Sp = 0.96，CC = 0.84）。枯草芽孢杆菌启动子识别CNN模型的Sn = 0.91，Sp = 0.95，CC = 0.86。对于人类、小鼠和拟南芥启动子，我们使用CNN来识别两种著名的启动子类别（TATA和非TATA启动子）。CNN模型能够很好地识别这些复杂的功能区域。对于人类启动子，TATA启动子序列预测的Sn/Sp/CC准确率分别达到0.95/0.98/0.90，非TATA启动子序列的准确率为0.90/0.98/0.89。对于拟南芥，我们观察到TATA启动子的Sn/Sp/CC为0.95/0.97/0.91，非TATA启动子为0.94/0.94/0.86。因此，在CNNProm程序中实现的已开发CNN模型证明了深度学习方法能够掌握复杂的启动子序列特征，并且与先前开发的启动子预测程序相比，具有显著更高的准确率。我们还提出了随机替换程序来发现位置保守的启动子功能元件。由于所建议的方法不需要任何特定启动子特征的知识，因此它可以很容易地扩展到识别许多其他尤其是新测序基因组序列中的启动子和其他复杂功能区域。CNNProm程序可在网页服务器http://www.softberry.com上运行。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4404/5291440/77327d837c46/pone.0171410.g001.jpg

相似文献

Recognition of prokaryotic and eukaryotic promoters using convolutional deep learning neural networks.使用卷积深度学习神经网络识别原核生物和真核生物启动子。

PLoS One. 2017 Feb 3;12(2):e0171410. doi: 10.1371/journal.pone.0171410. eCollection 2017.

iPTT(2 L)-CNN: A Two-Layer Predictor for Identifying Promoters and Their Types in Plant Genomes by Convolutional Neural Network.iPTT(2L)-CNN：一种基于卷积神经网络的两层预测器，用于识别植物基因组中的启动子及其类型。

Comput Math Methods Med. 2021 Jan 5;2021:6636350. doi: 10.1155/2021/6636350. eCollection 2021.

Computational identification of eukaryotic promoters based on cascaded deep capsule neural networks.基于级联深度胶囊神经网络的真核启动子计算识别。

Brief Bioinform. 2021 Jul 20;22(4). doi: 10.1093/bib/bbaa299.

iPromoter-Seqvec: identifying promoters using bidirectional long short-term memory and sequence-embedded features.iPromoter-Seqvec：使用双向长短时记忆和序列嵌入特征识别启动子。

BMC Genomics. 2022 Oct 3;23(Suppl 5):681. doi: 10.1186/s12864-022-08829-6.

iProm-Zea: A two-layer model to identify plant promoters and their types using convolutional neural network.iProm-Zea：一种使用卷积神经网络识别植物启动子及其类型的两层模型。

Genomics. 2022 May;114(3):110384. doi: 10.1016/j.ygeno.2022.110384. Epub 2022 May 6.

Pol II promoter prediction using characteristic 4-mer motifs: a machine learning approach.利用特征四聚体基序进行RNA聚合酶II启动子预测：一种机器学习方法

BMC Bioinformatics. 2008 Oct 4;9:414. doi: 10.1186/1471-2105-9-414.

GraphPro: An interpretable graph neural network-based model for identifying promoters in multiple species.GraphPro：一种基于可解释图神经网络的模型，用于识别多个物种中的启动子。

Comput Biol Med. 2024 Sep;180:108974. doi: 10.1016/j.compbiomed.2024.108974. Epub 2024 Aug 2.

pcPromoter-CNN: A CNN-Based Prediction and Classification of Promoters.pcPromoter-CNN：一种基于 CNN 的启动子预测和分类方法。

Genes (Basel). 2020 Dec 21;11(12):1529. doi: 10.3390/genes11121529.

Genome wide analysis of Arabidopsis core promoters.拟南芥核心启动子的全基因组分析。

BMC Genomics. 2005 Feb 25;6:25. doi: 10.1186/1471-2164-6-25.

Eukaryotic and prokaryotic promoter prediction using hybrid approach.使用混合方法进行真核和原核启动子预测。

Theory Biosci. 2011 Jun;130(2):91-100. doi: 10.1007/s12064-010-0114-8. Epub 2010 Nov 3.

引用本文的文献

iPro-CSAF: identification of promoters based on convolutional spiking neural networks and spiking attention mechanism.iPro-CSAF：基于卷积脉冲神经网络和脉冲注意力机制的启动子识别

PeerJ Comput Sci. 2025 Mar 26;11:e2761. doi: 10.7717/peerj-cs.2761. eCollection 2025.

DeepRice6mA: A convolutional neural network approach for 6mA site prediction in the rice Genome.深度水稻6mA：一种用于水稻基因组中6mA位点预测的卷积神经网络方法。

PLoS One. 2025 Jun 18;20(6):e0325216. doi: 10.1371/journal.pone.0325216. eCollection 2025.

Enhancing nucleotide sequence representations in genomic analysis with contrastive optimization.通过对比优化增强基因组分析中的核苷酸序列表示。

Commun Biol. 2025 Mar 29;8(1):517. doi: 10.1038/s42003-025-07902-6.

Negative dataset selection impacts machine learning-based predictors for multiple bacterial species promoters.阴性数据集选择会影响基于机器学习的多种细菌物种启动子预测器。

Bioinformatics. 2025 Mar 29;41(4). doi: 10.1093/bioinformatics/btaf135.

A conserved pilin from uncultured gut bacterial clade TANB77 enhances cancer immunotherapy.来自未培养肠道细菌进化枝TANB77的一种保守菌毛蛋白可增强癌症免疫疗法。

Nat Commun. 2024 Dec 27;15(1):10726. doi: 10.1038/s41467-024-55388-3.

Deep learning approaches for non-coding genetic variant effect prediction: current progress and future prospects.深度学习方法在非编码遗传变异效应预测中的应用：当前进展与未来展望。

Brief Bioinform. 2024 Jul 25;25(5). doi: 10.1093/bib/bbae446.

Understanding the natural language of DNA using encoder-decoder foundation models with byte-level precision.使用具有字节级精度的编码器-解码器基础模型理解DNA的自然语言。

Bioinform Adv. 2024 Aug 12;4(1):vbae117. doi: 10.1093/bioadv/vbae117. eCollection 2024.

Predicting Promoters in Multiple Prokaryotes with Prompt.利用 Prompt 预测多种原核生物的启动子。

Interdiscip Sci. 2024 Dec;16(4):814-828. doi: 10.1007/s12539-024-00637-8. Epub 2024 Aug 7.

Fine-Tuning Gene Expression in Bacteria by Synthetic Promoters.通过合成启动子精细调控细菌中的基因表达。

Methods Mol Biol. 2024;2844:179-195. doi: 10.1007/978-1-0716-4063-0_12.

Promoters in Pichia pastoris: A Toolbox for Fine-Tuned Gene Expression.巴斯德毕赤酵母启动子：精细基因表达的工具盒。

Methods Mol Biol. 2024;2844:159-178. doi: 10.1007/978-1-0716-4063-0_11.

本文引用的文献

The Ensembl gene annotation system.Ensembl基因注释系统。

Database (Oxford). 2016 Jun 23;2016. doi: 10.1093/database/baw093. Print 2016.

DanQ: a hybrid convolutional and recurrent deep neural network for quantifying the function of DNA sequences.DanQ：一种用于量化DNA序列功能的卷积与循环相结合的深度神经网络。

Nucleic Acids Res. 2016 Jun 20;44(11):e107. doi: 10.1093/nar/gkw226. Epub 2016 Apr 15.

Gene expression inference with deep learning.基于深度学习的基因表达推断

Bioinformatics. 2016 Jun 15;32(12):1832-9. doi: 10.1093/bioinformatics/btw074. Epub 2016 Feb 11.

RegulonDB version 9.0: high-level integration of gene regulation, coexpression, motif clustering and beyond.RegulonDB 9.0版本：基因调控、共表达、基序聚类及其他方面的高级整合。

Nucleic Acids Res. 2016 Jan 4;44(D1):D133-43. doi: 10.1093/nar/gkv1156. Epub 2015 Nov 2.

Predicting effects of noncoding variants with deep learning-based sequence model.使用基于深度学习的序列模型预测非编码变异的影响。

Nat Methods. 2015 Oct;12(10):931-4. doi: 10.1038/nmeth.3547. Epub 2015 Aug 24.

Predicting the sequence specificities of DNA- and RNA-binding proteins by deep learning.通过深度学习预测 DNA 和 RNA 结合蛋白的序列特异性。

Nat Biotechnol. 2015 Aug;33(8):831-8. doi: 10.1038/nbt.3300. Epub 2015 Jul 27.

Deep learning.深度学习。

Nature. 2015 May 28;521(7553):436-44. doi: 10.1038/nature14539.

Deep learning in neural networks: an overview.神经网络中的深度学习：综述。

Neural Netw. 2015 Jan;61:85-117. doi: 10.1016/j.neunet.2014.09.003. Epub 2014 Oct 13.

DBTSS as an integrative platform for transcriptome, epigenome and genome sequence variation data.DBTSS作为一个用于转录组、表观基因组和基因组序列变异数据的综合平台。

Nucleic Acids Res. 2015 Jan;43(Database issue):D87-91. doi: 10.1093/nar/gku1080. Epub 2014 Nov 5.

NPEST: a nonparametric method and a database for transcription start site prediction.NPEST：一种用于转录起始位点预测的非参数方法及数据库。

Quant Biol. 2013 Dec;1(4):261-271. doi: 10.1007/s40484-013-0022-2.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

使用卷积深度学习神经网络识别原核生物和真核生物启动子。

Recognition of prokaryotic and eukaryotic promoters using convolutional deep learning neural networks.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献