Kim Jeehong, Shujaat Muhammad, Tayara Hilal
Department of New & Renewable Energy, VISION College of Jeonju, Jeonju 55069, Republic of Korea.
Department of Electronics and Information Engineering, Jeonbuk National University, Jeonju 54896, Republic of Korea.
Genomics. 2022 May;114(3):110384. doi: 10.1016/j.ygeno.2022.110384. Epub 2022 May 6.
A promoter is a short DNA sequence near the start codon, responsible for initiating the transcription of a specific gene in the genome. The accurate recognition of promoters is important for achieving a better understanding of transcriptional regulation. Because of their importance in the process of biological transcriptional regulation, there is an urgent need to develop in silico tools to identify promoters and their types in a timely and accurate manner. A number of prediction methods have been developed in this regard; however, almost all of them are merely used for identifying promoters and their strength or sigma types. The TATA box region in TATA promoter influences the post-transcriptional processes; therefore, in the current study, we developed a two-layer predictor called "iProm-Zea" using the convolutional neural network (CNN) for identify TATA and TATA less promoters. The first layer can be used to identify a given DNA sequence as a promoter or non-promoter. The second layer can be used to identify whether the recognized promoter is the TATA promoter. To find an optimal feature encoding scheme and model, we employed four feature encoding schemes on different machine learning and CNN algorithms, and based on the evaluation results, we selected a one-hot encoding scheme and a CNN model for iProm-Zea. The 5-fold cross validation testing results demonstrated that the constructed predictor showed great potential for identifying promoters and classifying them as TATA and TATA less promoters. Furthermore, we performed cross-species analysis of iProm-Zea to evaluate its performance in other species. Moreover, to make it easier for other experimental scientists to obtain the results they need, we established a freely accessible and user-friendly web server at http://nsclbio.jbnu.ac.kr/tools/iProm-Zea/.
启动子是起始密码子附近的一段短DNA序列,负责启动基因组中特定基因的转录。准确识别启动子对于更好地理解转录调控至关重要。由于它们在生物转录调控过程中的重要性,迫切需要开发计算机工具来及时、准确地识别启动子及其类型。在这方面已经开发了许多预测方法;然而,几乎所有这些方法都仅用于识别启动子及其强度或sigma类型。TATA启动子中的TATA盒区域影响转录后过程;因此,在当前研究中,我们使用卷积神经网络(CNN)开发了一种名为“iProm-Zea”的两层预测器,用于识别TATA启动子和TATA-less启动子。第一层可用于将给定的DNA序列识别为启动子或非启动子。第二层可用于识别识别出的启动子是否为TATA启动子。为了找到最佳的特征编码方案和模型,我们在不同的机器学习和CNN算法上采用了四种特征编码方案,并根据评估结果,为iProm-Zea选择了独热编码方案和CNN模型。5折交叉验证测试结果表明,构建的预测器在识别启动子并将其分类为TATA启动子和TATA-less启动子方面具有很大潜力。此外,我们对iProm-Zea进行了跨物种分析,以评估其在其他物种中的性能。此外,为了使其他实验科学家更容易获得他们需要的结果,我们在http://nsclbio.jbnu.ac.kr/tools/iProm-Zea/建立了一个免费访问且用户友好的网络服务器。