Department of Molecular Biology and Biotechnology, Tezpur University, Tezpur 784028, Assam, India.
Microbiology and Immunology, Dalhousie University, Halifax, Nova Scotia B3H 4H7, Canada.
J Chem Inf Model. 2024 Apr 8;64(7):2705-2719. doi: 10.1021/acs.jcim.3c02017. Epub 2024 Jan 23.
Bacterial promoters play a crucial role in gene expression by serving as docking sites for the transcription initiation machinery. However, accurately identifying promoter regions in bacterial genomes remains a challenge due to their diverse architecture and variations. In this study, we propose MLDSPP (Machine Learning and Duplex Stability based Promoter prediction in Prokaryotes), a machine learning-based promoter prediction tool, to comprehensively screen bacterial promoter regions in 12 diverse genomes. We leveraged biologically relevant and informative DNA structural properties, such as DNA duplex stability and base stacking, and state-of-the-art machine learning (ML) strategies to gain insights into promoter characteristics. We evaluated several machine learning models, including Support Vector Machines, Random Forests, and XGBoost, and assessed their performance using accuracy, precision, recall, specificity, F1 score, and MCC metrics. Our findings reveal that XGBoost outperformed other models and current state-of-the-art promoter prediction tools, namely Sigma70pred and iPromoter2L, achieving F1-scores >95% in most systems. Significantly, the use of one-hot encoding for representing nucleotide sequences complements these structural features, enhancing our XGBoost model's predictive capabilities. To address the challenge of model interpretability, we incorporated explainable AI techniques using Shapley values. This enhancement allows for a better understanding and interpretation of the predictions of our model. In conclusion, our study presents MLDSPP as a novel, generic tool for predicting promoter regions in bacteria, utilizing original downstream sequences as nonpromoter controls. This tool has the potential to significantly advance the field of bacterial genomics and contribute to our understanding of gene regulation in diverse bacterial systems.
细菌启动子在基因表达中起着至关重要的作用,它们作为转录起始机制的停靠点。然而,由于其多样的结构和变化,准确识别细菌基因组中的启动子区域仍然是一个挑战。在这项研究中,我们提出了 MLDSPP(基于机器学习和双链稳定性的原核生物启动子预测),这是一种基于机器学习的启动子预测工具,用于全面筛选 12 个不同基因组中的细菌启动子区域。我们利用了与生物学相关且信息量丰富的 DNA 结构特性,如 DNA 双链稳定性和碱基堆积,并采用了最先进的机器学习(ML)策略来深入了解启动子的特征。我们评估了几种机器学习模型,包括支持向量机、随机森林和 XGBoost,并使用准确性、精度、召回率、特异性、F1 分数和 MCC 度量来评估它们的性能。我们的研究结果表明,XGBoost 优于其他模型和当前最先进的启动子预测工具,即 Sigma70pred 和 iPromoter2L,在大多数系统中实现了 F1 分数>95%。值得注意的是,使用独热编码表示核苷酸序列补充了这些结构特征,增强了我们的 XGBoost 模型的预测能力。为了解决模型可解释性的挑战,我们使用 Shapley 值结合了可解释性 AI 技术。这种增强使得我们的模型的预测更容易理解和解释。总之,我们的研究提出了 MLDSPP,这是一种预测细菌启动子区域的新的、通用的工具,它使用原始下游序列作为非启动子控制。该工具具有极大地推动细菌基因组学领域的发展并促进我们对不同细菌系统中基因调控的理解的潜力。