Suppr超能文献

利用堆叠集成学习框架对大肠杆菌中的一般和特定类型启动子进行计算预测和解释。

Computational prediction and interpretation of both general and specific types of promoters in Escherichia coli by exploiting a stacked ensemble-learning framework.

机构信息

Northwest A&F University, China.

Department of Biochemistry and Molecular Biology and the Infection and Immunity Program, Biomedicine Discovery Institute, Monash University, Australia.

出版信息

Brief Bioinform. 2021 Mar 22;22(2):2126-2140. doi: 10.1093/bib/bbaa049.

Abstract

Promoters are short consensus sequences of DNA, which are responsible for transcription activation or the repression of all genes. There are many types of promoters in bacteria with important roles in initiating gene transcription. Therefore, solving promoter-identification problems has important implications for improving the understanding of their functions. To this end, computational methods targeting promoter classification have been established; however, their performance remains unsatisfactory. In this study, we present a novel stacked-ensemble approach (termed SELECTOR) for identifying both promoters and their respective classification. SELECTOR combined the composition of k-spaced nucleic acid pairs, parallel correlation pseudo-dinucleotide composition, position-specific trinucleotide propensity based on single-strand, and DNA strand features and using five popular tree-based ensemble learning algorithms to build a stacked model. Both 5-fold cross-validation tests using benchmark datasets and independent tests using the newly collected independent test dataset showed that SELECTOR outperformed state-of-the-art methods in both general and specific types of promoter prediction in Escherichia coli. Furthermore, this novel framework provides essential interpretations that aid understanding of model success by leveraging the powerful Shapley Additive exPlanation algorithm, thereby highlighting the most important features relevant for predicting both general and specific types of promoters and overcoming the limitations of existing 'Black-box' approaches that are unable to reveal causal relationships from large amounts of initially encoded features.

摘要

启动子是 DNA 的短共有序列,负责转录激活或所有基因的抑制。细菌中有许多类型的启动子,它们在启动基因转录方面起着重要作用。因此,解决启动子识别问题对提高对其功能的理解具有重要意义。为此,已经建立了针对启动子分类的计算方法;然而,它们的性能仍然不尽如人意。在这项研究中,我们提出了一种新的堆叠集成方法(称为 SELECTOR),用于识别启动子及其各自的分类。SELECTOR 结合了 k 间隔核酸对的组成、并行相关伪二核苷酸组成、基于单链的位置特异性三核苷酸倾向以及 DNA 链特征,并使用五种流行的基于树的集成学习算法构建堆叠模型。使用基准数据集进行的 5 倍交叉验证测试和使用新收集的独立测试数据集进行的独立测试均表明,SELECTOR 在大肠杆菌中通用和特定类型启动子预测方面均优于最先进的方法。此外,该新框架通过利用强大的 Shapley Additive exPlanation 算法提供了必要的解释,从而有助于理解模型的成功,突出了预测通用和特定类型启动子最相关的重要特征,并克服了现有“黑盒”方法的局限性,这些方法无法从大量最初编码的特征中揭示因果关系。

相似文献

引用本文的文献

3
A stacking ensemble model for predicting the occurrence of carotid atherosclerosis.用于预测颈动脉粥样硬化发生的堆叠集成模型。
Front Endocrinol (Lausanne). 2024 Jul 23;15:1390352. doi: 10.3389/fendo.2024.1390352. eCollection 2024.
8
INTEDE 2.0: the metabolic roadmap of drugs.INTEDE 2.0:药物的代谢途径图。
Nucleic Acids Res. 2024 Jan 5;52(D1):D1355-D1364. doi: 10.1093/nar/gkad1013.

本文引用的文献

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验