Suppr超能文献

使用卷积深度学习神经网络识别原核生物和真核生物启动子。

Recognition of prokaryotic and eukaryotic promoters using convolutional deep learning neural networks.

作者信息

Umarov Ramzan Kh, Solovyev Victor V

机构信息

King Abdullah University of Science and Technology, Thuwal, Saudi Arabia.

Softberry Inc., Mount Kisco, United States of America.

出版信息

PLoS One. 2017 Feb 3;12(2):e0171410. doi: 10.1371/journal.pone.0171410. eCollection 2017.

Abstract

Accurate computational identification of promoters remains a challenge as these key DNA regulatory regions have variable structures composed of functional motifs that provide gene-specific initiation of transcription. In this paper we utilize Convolutional Neural Networks (CNN) to analyze sequence characteristics of prokaryotic and eukaryotic promoters and build their predictive models. We trained a similar CNN architecture on promoters of five distant organisms: human, mouse, plant (Arabidopsis), and two bacteria (Escherichia coli and Bacillus subtilis). We found that CNN trained on sigma70 subclass of Escherichia coli promoter gives an excellent classification of promoters and non-promoter sequences (Sn = 0.90, Sp = 0.96, CC = 0.84). The Bacillus subtilis promoters identification CNN model achieves Sn = 0.91, Sp = 0.95, and CC = 0.86. For human, mouse and Arabidopsis promoters we employed CNNs for identification of two well-known promoter classes (TATA and non-TATA promoters). CNN models nicely recognize these complex functional regions. For human promoters Sn/Sp/CC accuracy of prediction reached 0.95/0.98/0,90 on TATA and 0.90/0.98/0.89 for non-TATA promoter sequences, respectively. For Arabidopsis we observed Sn/Sp/CC 0.95/0.97/0.91 (TATA) and 0.94/0.94/0.86 (non-TATA) promoters. Thus, the developed CNN models, implemented in CNNProm program, demonstrated the ability of deep learning approach to grasp complex promoter sequence characteristics and achieve significantly higher accuracy compared to the previously developed promoter prediction programs. We also propose random substitution procedure to discover positionally conserved promoter functional elements. As the suggested approach does not require knowledge of any specific promoter features, it can be easily extended to identify promoters and other complex functional regions in sequences of many other and especially newly sequenced genomes. The CNNProm program is available to run at web server http://www.softberry.com.

摘要

准确地通过计算识别启动子仍然是一项挑战,因为这些关键的DNA调控区域具有由功能基序组成的可变结构,这些功能基序可提供基因特异性的转录起始。在本文中,我们利用卷积神经网络(CNN)来分析原核和真核启动子的序列特征并构建其预测模型。我们在五种远缘生物的启动子上训练了类似的CNN架构:人类、小鼠、植物(拟南芥)以及两种细菌(大肠杆菌和枯草芽孢杆菌)。我们发现,在大肠杆菌启动子的sigma70亚类上训练的CNN对启动子和非启动子序列进行了出色的分类(Sn = 0.90,Sp = 0.96,CC = 0.84)。枯草芽孢杆菌启动子识别CNN模型的Sn = 0.91,Sp = 0.95,CC = 0.86。对于人类、小鼠和拟南芥启动子,我们使用CNN来识别两种著名的启动子类别(TATA和非TATA启动子)。CNN模型能够很好地识别这些复杂的功能区域。对于人类启动子,TATA启动子序列预测的Sn/Sp/CC准确率分别达到0.95/0.98/0.90,非TATA启动子序列的准确率为0.90/0.98/0.89。对于拟南芥,我们观察到TATA启动子的Sn/Sp/CC为0.95/0.97/0.91,非TATA启动子为0.94/0.94/0.86。因此,在CNNProm程序中实现的已开发CNN模型证明了深度学习方法能够掌握复杂的启动子序列特征,并且与先前开发的启动子预测程序相比,具有显著更高的准确率。我们还提出了随机替换程序来发现位置保守的启动子功能元件。由于所建议的方法不需要任何特定启动子特征的知识,因此它可以很容易地扩展到识别许多其他尤其是新测序基因组序列中的启动子和其他复杂功能区域。CNNProm程序可在网页服务器http://www.softberry.com上运行。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4404/5291440/77327d837c46/pone.0171410.g001.jpg

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验