Department of Computational Medicine and Bioinformatics, University of Michigan Medical School, Ann Arbor, MI, USA.
Department of Biostatistics, School of Public Health, University of Michigan Medical School, Ann Arbor, MI, USA.
Genome Biol. 2022 Apr 26;23(1):105. doi: 10.1186/s13059-022-02668-0.
Revealing the gene targets of distal regulatory elements is challenging yet critical for interpreting regulome data. Experiment-derived enhancer-gene links are restricted to a small set of enhancers and/or cell types, while the accuracy of genome-wide approaches remains elusive due to the lack of a systematic evaluation. We combined multiple spatial and in silico approaches for defining enhancer locations and linking them to their target genes aggregated across >500 cell types, generating 1860 human genome-wide distal enhancer-to-target gene definitions (EnTDefs). To evaluate performance, we used gene set enrichment (GSE) testing on 87 independent ENCODE ChIP-seq datasets of 34 transcription factors (TFs) and assessed concordance of results with known TF Gene Ontology annotations, and other benchmarks.
The top ranked 741 (40%) EnTDefs significantly outperform the common, naïve approach of linking distal regions to the nearest genes, and the top 10 EnTDefs perform well when applied to ChIP-seq data of other cell types. The GSE-based ranking of EnTDefs is highly concordant with ranking based on overlap with curated benchmarks of enhancer-gene interactions. Both our top general EnTDef and cell-type-specific EnTDefs significantly outperform seven independent computational and experiment-based enhancer-gene pair datasets. We show that using our top EnTDefs for GSE with either genome-wide DNA methylation or ATAC-seq data is able to better recapitulate the biological processes changed in gene expression data performed in parallel for the same experiment than our lower-ranked EnTDefs.
Our findings illustrate the power of our approach to provide genome-wide interpretation regardless of cell type.
揭示远端调控元件的基因靶标具有挑战性,但对于解释调控组数据至关重要。实验衍生的增强子-基因联系仅限于一小部分增强子和/或细胞类型,而由于缺乏系统评估,全基因组方法的准确性仍然难以捉摸。我们结合了多种空间和计算方法来定义增强子位置,并将其与跨越>500 种细胞类型的目标基因联系起来,生成了 1860 个人类全基因组远端增强子-目标基因定义(EnTDefs)。为了评估性能,我们使用了 87 个独立的 ENCODE ChIP-seq 数据集的基因集富集(GSE)测试,这些数据集包含 34 个转录因子(TFs),并评估了结果与已知 TF 基因本体注释以及其他基准的一致性。
排名前 741 位(40%)的 EnTDefs 明显优于将远端区域与最近基因联系的常见、简单方法,并且当应用于其他细胞类型的 ChIP-seq 数据时,排名前 10 的 EnTDefs 表现良好。基于 GSE 的 EnTDefs 排名与基于与经过验证的增强子-基因相互作用的基准的重叠的排名高度一致。我们的顶级一般 EnTDef 和细胞类型特异性 EnTDefs 都明显优于七个独立的基于计算和实验的增强子-基因对数据集。我们表明,使用我们的顶级 EnTDefs 进行 GSE,无论是使用全基因组 DNA 甲基化还是 ATAC-seq 数据,都能够比我们的低排名 EnTDefs 更好地重现与同一实验并行进行的基因表达数据中改变的生物学过程。
我们的研究结果表明,无论细胞类型如何,我们的方法都具有提供全基因组解释的强大能力。