Umarov Ramzan Kh, Solovyev Victor V
King Abdullah University of Science and Technology, Thuwal, Saudi Arabia.
Softberry Inc., Mount Kisco, United States of America.
PLoS One. 2017 Feb 3;12(2):e0171410. doi: 10.1371/journal.pone.0171410. eCollection 2017.
Accurate computational identification of promoters remains a challenge as these key DNA regulatory regions have variable structures composed of functional motifs that provide gene-specific initiation of transcription. In this paper we utilize Convolutional Neural Networks (CNN) to analyze sequence characteristics of prokaryotic and eukaryotic promoters and build their predictive models. We trained a similar CNN architecture on promoters of five distant organisms: human, mouse, plant (Arabidopsis), and two bacteria (Escherichia coli and Bacillus subtilis). We found that CNN trained on sigma70 subclass of Escherichia coli promoter gives an excellent classification of promoters and non-promoter sequences (Sn = 0.90, Sp = 0.96, CC = 0.84). The Bacillus subtilis promoters identification CNN model achieves Sn = 0.91, Sp = 0.95, and CC = 0.86. For human, mouse and Arabidopsis promoters we employed CNNs for identification of two well-known promoter classes (TATA and non-TATA promoters). CNN models nicely recognize these complex functional regions. For human promoters Sn/Sp/CC accuracy of prediction reached 0.95/0.98/0,90 on TATA and 0.90/0.98/0.89 for non-TATA promoter sequences, respectively. For Arabidopsis we observed Sn/Sp/CC 0.95/0.97/0.91 (TATA) and 0.94/0.94/0.86 (non-TATA) promoters. Thus, the developed CNN models, implemented in CNNProm program, demonstrated the ability of deep learning approach to grasp complex promoter sequence characteristics and achieve significantly higher accuracy compared to the previously developed promoter prediction programs. We also propose random substitution procedure to discover positionally conserved promoter functional elements. As the suggested approach does not require knowledge of any specific promoter features, it can be easily extended to identify promoters and other complex functional regions in sequences of many other and especially newly sequenced genomes. The CNNProm program is available to run at web server http://www.softberry.com.
准确地通过计算识别启动子仍然是一项挑战,因为这些关键的DNA调控区域具有由功能基序组成的可变结构,这些功能基序可提供基因特异性的转录起始。在本文中,我们利用卷积神经网络(CNN)来分析原核和真核启动子的序列特征并构建其预测模型。我们在五种远缘生物的启动子上训练了类似的CNN架构:人类、小鼠、植物(拟南芥)以及两种细菌(大肠杆菌和枯草芽孢杆菌)。我们发现,在大肠杆菌启动子的sigma70亚类上训练的CNN对启动子和非启动子序列进行了出色的分类(Sn = 0.90,Sp = 0.96,CC = 0.84)。枯草芽孢杆菌启动子识别CNN模型的Sn = 0.91,Sp = 0.95,CC = 0.86。对于人类、小鼠和拟南芥启动子,我们使用CNN来识别两种著名的启动子类别(TATA和非TATA启动子)。CNN模型能够很好地识别这些复杂的功能区域。对于人类启动子,TATA启动子序列预测的Sn/Sp/CC准确率分别达到0.95/0.98/0.90,非TATA启动子序列的准确率为0.90/0.98/0.89。对于拟南芥,我们观察到TATA启动子的Sn/Sp/CC为0.95/0.97/0.91,非TATA启动子为0.94/0.94/0.86。因此,在CNNProm程序中实现的已开发CNN模型证明了深度学习方法能够掌握复杂的启动子序列特征,并且与先前开发的启动子预测程序相比,具有显著更高的准确率。我们还提出了随机替换程序来发现位置保守的启动子功能元件。由于所建议的方法不需要任何特定启动子特征的知识,因此它可以很容易地扩展到识别许多其他尤其是新测序基因组序列中的启动子和其他复杂功能区域。CNNProm程序可在网页服务器http://www.softberry.com上运行。