School of Systems and Technology, University of Management and Technology, Lahore 54770, Pakistan.
School of Systems and Technology, University of Management and Technology, Lahore 54770, Pakistan.
Methods. 2024 Oct;230:119-128. doi: 10.1016/j.ymeth.2024.08.005. Epub 2024 Aug 19.
Promoters, which are short (50-1500 base-pair) in DNA regions, have emerged to play a critical role in the regulation of gene transcription. Numerous dangerous diseases, likewise cancer, cardiovascular, and inflammatory bowel diseases, are caused by genetic variations in promoters. Consequently, the correct identification and characterization of promoters are significant for the discovery of drugs. However, experimental approaches to recognizing promoters and their strengths are challenging in terms of cost, time, and resources. Therefore, computational techniques are highly desirable for the correct characterization of promoters from unannotated genomic data. Here, we designed a powerful bi-layer deep-learning based predictor named "PROCABLES", which discriminates DNA samples as promoters in the first-phase and strong or weak promoters in the second-phase respectively. The proposed method utilizes five distinct features, such as word2vec, k-spaced nucleotide pairs, trinucleotide propensity-based features, trinucleotide composition, and electron-ion interaction pseudopotentials, to extract the hidden patterns from the DNA sequence. Afterwards, a stacked framework is formed by integrating a convolutional neural network (CNN) with bidirectional long-short-term memory (LSTM) using multi-view attributes to train the proposed model. The PROCABLES model achieved an accuracy of 0.971 and 0.920 and the MCC 0.940 and 0.840 for the first and second-layer using the ten-fold cross-validation test, respectively. The predicted results anticipate that the proposed PROCABLES protocol outperformed the advanced computational predictors targeting promoters and their types. In summary, this research will provide useful hints for the recognition of large-scale promoters in particular and other DNA problems in general.
启动子是 DNA 区域中的短序列(50-1500 个碱基对),在基因转录调控中发挥着关键作用。许多危险的疾病,如癌症、心血管疾病和炎症性肠病,都是由启动子的遗传变异引起的。因此,正确识别和描述启动子对于发现药物非常重要。然而,从无注释的基因组数据中识别启动子及其强度的实验方法在成本、时间和资源方面都具有挑战性。因此,计算技术对于从非注释的基因组数据中正确描述启动子是非常可取的。在这里,我们设计了一种强大的基于双层深度学习的预测器,名为“PROCABLES”,它可以分别在第一阶段将 DNA 样本识别为启动子,在第二阶段将其识别为强启动子或弱启动子。该方法利用了五个不同的特征,如 word2vec、k-间隔核苷酸对、三核苷酸倾向性特征、三核苷酸组成和电子-离子相互作用伪势,从 DNA 序列中提取隐藏模式。然后,通过使用多视图属性将卷积神经网络(CNN)与双向长短期记忆(LSTM)集成到一个堆叠框架中,来训练所提出的模型。PROCABLES 模型在使用十折交叉验证测试时,在第一层和第二层的准确率分别达到了 0.971 和 0.920,MCC 分别达到了 0.940 和 0.840。预测结果表明,所提出的 PROCABLES 方案优于针对启动子及其类型的先进计算预测器。总之,这项研究将为大规模启动子的识别以及其他一般的 DNA 问题提供有用的启示。