School of Engineering, Air-Space-Ground Integrated Intelligence and Big Data Application Engineering Research Center of Yunnan Provincial Department of Education, Dali University, Dali, 671003, China.
College of Biotechnology, Tianjin University of Science & Technology, Tianjin, 300457, China.
Interdiscip Sci. 2024 Dec;16(4):814-828. doi: 10.1007/s12539-024-00637-8. Epub 2024 Aug 7.
Promoters are important cis-regulatory elements for the regulation of gene expression, and their accurate predictions are crucial for elucidating the biological functions and potential mechanisms of genes. Many previous prokaryotic promoter prediction methods are encouraging in terms of the prediction performance, but most of them focus on the recognition of promoters in only one or a few bacterial species. Moreover, due to ignoring the promoter sequence motifs, the interpretability of predictions with existing methods is limited. In this work, we present a generalized method Prompt (Promoters in multiple prokaryotes) to predict promoters in 16 prokaryotes and improve the interpretability of prediction results. Prompt integrates three methods including RSK (Regression based on Selected k-mer), CL (Contrastive Learning) and MLP (Multilayer Perception), and employs a voting strategy to divide the datasets into high-confidence and low-confidence categories. Results on the promoter prediction tasks in 16 prokaryotes show that the accuracy (Accuracy, Matthews correlation coefficient) of Prompt is greater than 80% in highly credible datasets of 16 prokaryotes, and is greater than 90% in 12 prokaryotes, and Prompt performs the best compared with other existing methods. Moreover, by identifying promoter sequence motifs, Prompt can improve the interpretability of the predictions. Prompt is freely available at https://github.com/duqimeng/PromptPrompt , and will contribute to the research of promoters in prokaryote.
启动子是调节基因表达的重要顺式调控元件,准确预测启动子对于阐明基因的生物学功能和潜在机制至关重要。许多先前的原核启动子预测方法在预测性能方面令人鼓舞,但它们大多数都集中在识别一个或几个细菌物种中的启动子。此外,由于忽略了启动子序列基序,现有方法的预测结果的可解释性有限。在这项工作中,我们提出了一种通用方法 Prompt(多原核启动子)来预测 16 种原核生物中的启动子,并提高预测结果的可解释性。Prompt 集成了包括 RSK(基于选择的 k-mer 的回归)、CL(对比学习)和 MLP(多层感知机)在内的三种方法,并采用投票策略将数据集分为高可信度和低可信度两类。在 16 种原核生物的启动子预测任务上的结果表明,在 16 种原核生物的高可信度数据集上,Prompt 的准确率(Accuracy、马修斯相关系数)大于 80%,在 12 种原核生物上大于 90%,并且与其他现有方法相比表现最佳。此外,通过识别启动子序列基序,Prompt 可以提高预测结果的可解释性。Prompt 可在 https://github.com/duqimeng/Prompt 上免费获得,并将有助于原核生物中启动子的研究。