BioFrontiers Institute, University of Colorado Boulder, Boulder, Colorado, United States of America.
Department of Molecular and Cellular Biology, University of Colorado Boulder, Boulder, Colorado, United States of America.
PLoS One. 2021 Jun 24;16(6):e0237055. doi: 10.1371/journal.pone.0237055. eCollection 2021.
A key aspect in defining cell state is the complex choreography of DNA binding events in a given cell type, which in turn establishes a cell-specific gene-expression program. Here we wanted to take a deep analysis of DNA binding events and transcriptional output of a single cell state (K562 cells). To this end we re-analyzed 195 DNA binding proteins contained in ENCODE data. We used standardized analysis pipelines, containerization, and literate programming with R Markdown for reproducibility and rigor. Our approach validated many findings from previous independent studies, underscoring the importance of ENCODE's goals in providing these reproducible data resources. We also had several new findings including: (i) 1,362 promoters, which we refer to as 'reservoirs,' that are defined by having up to 111 different DNA binding-proteins localized on one promoter, yet do not have any expression of steady-state RNA (ii) Reservoirs do not overlap super-enhancer annotations and distinct have distinct properties from super-enhancers. (iii) The human specific SVA repeat element may have been co-opted for enhancer regulation and is highly transcribed in PRO-seq and RNA-seq. Collectively, this study performed by the students of a CU Boulder computational biology class (BCHM 5631 -Spring 2020) demonstrates the value of reproducible findings and how resources like ENCODE that prioritize data standards can foster new findings with existing data in a didactic environment.
定义细胞状态的一个关键方面是特定细胞类型中 DNA 结合事件的复杂编排,这反过来又建立了细胞特异性的基因表达程序。在这里,我们希望对单个细胞状态(K562 细胞)的 DNA 结合事件和转录输出进行深入分析。为此,我们重新分析了 ENCODE 数据中包含的 195 种 DNA 结合蛋白。我们使用标准化的分析管道、容器化和使用 R Markdown 的文学编程来实现可重复性和严谨性。我们的方法验证了许多来自先前独立研究的发现,强调了 ENCODE 提供这些可重复数据资源的目标的重要性。我们还有一些新的发现,包括:(i)1362 个启动子,我们称之为“储层”,这些启动子的定义是在一个启动子上有多达 111 种不同的 DNA 结合蛋白定位,但没有任何稳态 RNA 的表达;(ii)储层与超级增强子注释不重叠,并且与超级增强子具有不同的特性;(iii)人类特异性 SVA 重复元件可能被用于增强子调控,并且在 PRO-seq 和 RNA-seq 中高度转录。总的来说,这项由科罗拉多大学博尔德分校计算生物学班(BCHM 5631-2020 年春季)的学生进行的研究表明了可重复发现的价值,以及像 ENCODE 这样优先考虑数据标准的资源如何可以在教学环境中利用现有数据发现新的发现。