Suppr超能文献

CDBProm:细菌启动子综合目录

CDBProm: the Comprehensive Directory of Bacterial Promoters.

作者信息

Martinez Gustavo Sganzerla, Perez-Rueda Ernesto, Kumar Anuj, Dutt Mansi, Maya Cinthia Rodríguez, Ledesma-Dominguez Leonardo, Casa Pedro Lenz, Kumar Aditya, de Avila E Silva Scheila, Kelvin David J

机构信息

Microbiology and Immunology, Dalhousie University, Halifax, Nova Scotia B3H 4H7, Canada.

Pediatrics, Izaak Walton Killam (IWK) Health Center. Canadian Center for Vaccinology (CCfV), Halifax, Nova Scotia B3H 4H7, Canada.

出版信息

NAR Genom Bioinform. 2024 Feb 21;6(1):lqae018. doi: 10.1093/nargab/lqae018. eCollection 2024 Mar.

Abstract

The decreasing cost of whole genome sequencing has produced high volumes of genomic information that require annotation. The experimental identification of promoter sequences, pivotal for regulating gene expression, is a laborious and cost-prohibitive task. To expedite this, we introduce the Comprehensive Directory of Bacterial Promoters (CDBProm), a directory of predicted bacterial promoter sequences. We first identified that an Extreme Gradient Boosting (XGBoost) algorithm would distinguish promoters from random downstream regions with an accuracy of 87%. To capture distinctive promoter signals, we generated a second XGBoost classifier trained on the instances misclassified in our first classifier. The predictor of CDBProm is then fed with over 55 million upstream regions from more than 6000 bacterial genomes. Upon finding potential promoter sequences in upstream regions, each promoter is mapped to the genomic data of the organism, linking the predicted promoter with its coding DNA sequence, and identifying the function of the gene regulated by the promoter. The collection of bacterial promoters available in CDBProm enables the quantitative analysis of a plethora of bacterial promoters. Our collection with over 24 million promoters is publicly available at https://aw.iimas.unam.mx/cdbprom/.

摘要

全基因组测序成本的降低产生了大量需要注释的基因组信息。启动子序列对于调节基因表达至关重要,其实验鉴定是一项艰巨且成本高昂的任务。为了加快这一进程,我们推出了细菌启动子综合目录(CDBProm),这是一个预测细菌启动子序列的目录。我们首先确定,极端梯度提升(XGBoost)算法能够以87%的准确率将启动子与随机下游区域区分开来。为了捕捉独特的启动子信号,我们基于在第一个分类器中被误分类的实例训练了第二个XGBoost分类器。然后,CDBProm的预测器被输入来自6000多个细菌基因组的超过5500万个上游区域。在上游区域发现潜在的启动子序列后,每个启动子都被映射到该生物体的基因组数据上,将预测的启动子与其编码DNA序列联系起来,并确定由该启动子调控的基因的功能。CDBProm中可用的细菌启动子集合能够对大量细菌启动子进行定量分析。我们拥有超过2400万个启动子的集合可在https://aw.iimas.unam.mx/cdbprom/上公开获取。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/ab7e/10880602/eb5171713fb8/lqae018figgra1.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验