Centro de Ciencias Genómicas, Universidad Nacional Autónoma de México, Avenida Universidad s/n, Cuernavaca 62210, Morelos, Mexico.
Department of Biomedical Engineering, Boston University, 44 Cummington Mall, Boston, MA 02215, USA.
Microb Genom. 2022 May;8(5). doi: 10.1099/mgen.0.000833.
Genomics has set the basis for a variety of methodologies that produce high-throughput datasets identifying the different players that define gene regulation, particularly regulation of transcription initiation and operon organization. These datasets are available in public repositories, such as the Gene Expression Omnibus, or ArrayExpress. However, accessing and navigating such a wealth of data is not straightforward. No resource currently exists that offers all available high and low-throughput data on transcriptional regulation in K-12 to easily use both as whole datasets, or as individual interactions and regulatory elements. RegulonDB (https://regulondb.ccg.unam.mx) began gathering high-throughput dataset collections in 2009, starting with transcription start sites, then adding ChIP-seq and gSELEX in 2012, with up to 99 different experimental high-throughput datasets available in 2019. In this paper we present a radical upgrade to more than 2000 high-throughput datasets, processed to facilitate their comparison, introducing up-to-date collections of transcription termination sites, transcription units, as well as transcription factor binding interactions derived from ChIP-seq, ChIP-exo, gSELEX and DAP-seq experiments, besides expression profiles derived from RNA-seq experiments. For ChIP-seq experiments we offer both the data as presented by the authors, as well as data uniformly processed in-house, enhancing their comparability, as well as the traceability of the methods and reproducibility of the results. Furthermore, we have expanded the tools available for browsing and visualization across and within datasets. We include comparisons against previously existing knowledge in RegulonDB from classic experiments, a nucleotide-resolution genome viewer, and an interface that enables users to browse datasets by querying their metadata. A particular effort was made to automatically extract detailed experimental growth conditions by implementing an assisted curation strategy applying Natural language processing and machine learning. We provide summaries with the total number of interactions found in each experiment, as well as tools to identify common results among different experiments. This is a long-awaited resource to make use of such wealth of knowledge and advance our understanding of the biology of the model bacterium K-12.
基因组学为各种方法奠定了基础,这些方法产生了高通量数据集,可识别定义基因调控的不同参与者,特别是转录起始和操纵子组织的调控。这些数据集可在公共存储库中获得,例如基因表达综合数据库或 ArrayExpress。然而,访问和浏览如此丰富的数据并不简单。目前没有资源提供关于 K-12 转录调控的所有可用高和低通量数据,以便轻松地将其作为整体数据集或作为单个相互作用和调节元件使用。RegulonDB(https://regulondb.ccg.unam.mx)于 2009 年开始收集高通量数据集,从转录起始位点开始,然后在 2012 年添加了 ChIP-seq 和 gSELEX,到 2019 年,可用的实验性高通量数据集多达 99 个。在本文中,我们对 2000 多个高通量数据集进行了彻底升级,对其进行了处理,以方便比较,引入了最新的转录终止位点、转录单元以及来自 ChIP-seq、ChIP-exo、gSELEX 和 DAP-seq 实验的转录因子结合相互作用的集合,以及来自 RNA-seq 实验的表达谱。对于 ChIP-seq 实验,我们提供了作者提供的数据以及我们内部统一处理的数据,增强了它们的可比性以及方法的可追溯性和结果的可重复性。此外,我们还扩展了跨数据集和在数据集内浏览和可视化的可用工具。我们包括与 RegulonDB 中来自经典实验的先前存在的知识进行比较,一个核苷酸分辨率的基因组浏览器,以及一个允许用户通过查询元数据浏览数据集的接口。我们特别努力通过实施应用自然语言处理和机器学习的辅助策展策略,自动提取详细的实验生长条件。我们提供了每个实验中发现的相互作用总数的摘要,以及用于识别不同实验中常见结果的工具。这是一个期待已久的资源,可利用如此丰富的知识并推进我们对模型细菌 K-12 的生物学的理解。