Lee Christopher T, Cavalcante Raymond G, Lee Chee, Qin Tingting, Patil Snehal, Wang Shuze, Tsai Zing T Y, Boyle Alan P, Sartor Maureen A
Biostatistics Department, University of Michigan, Ann Arbor, MI 48109, USA.
Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI 48109, USA.
NAR Genom Bioinform. 2020 Mar;2(1):lqaa006. doi: 10.1093/nargab/lqaa006. Epub 2020 Feb 6.
Gene set enrichment (GSE) testing enhances the biological interpretation of ChIP-seq data and other large sets of genomic regions. Our group has previously introduced two GSE methods for genomic regions: ChIP-Enrich for narrow regions and Broad-Enrich for broad regions. Here, we introduce Poly-Enrich, which has wider applicability, additional capabilities and models the number of peaks assigned to a gene using a generalized additive model with a negative binomial family to determine gene set enrichment, while adjusting for gene locus length. As opposed to ChIP-Enrich, Poly-Enrich works well even when nearly all genes have a peak, illustrated by using Poly-Enrich to characterize pathways and types of genic regions enriched with different families of repetitive elements. By comparing Poly-Enrich and ChIP-Enrich results with ENCODE ChIP-seq data, we found that the optimal test depends more on the pathway being regulated than on properties of the transcription factors. Using known transcription factor functions, we discovered clusters of related biological processes consistently better modeled with Poly-Enrich. This suggests that the regulation of certain processes may be modified by multiple binding events, better modeled by a count-based method. Our new hybrid method automatically uses the optimal method for each gene set, with correct FDR-adjustment.
基因集富集(GSE)测试增强了ChIP-seq数据和其他大量基因组区域的生物学解释。我们团队之前已经为基因组区域引入了两种GSE方法:用于狭窄区域的ChIP-Enrich和用于宽泛区域的Broad-Enrich。在此,我们介绍Poly-Enrich,它具有更广泛的适用性、更多的功能,并且使用具有负二项分布族的广义相加模型对分配给一个基因的峰数量进行建模,以确定基因集富集,同时对基因座长度进行调整。与ChIP-Enrich不同,即使几乎所有基因都有一个峰时,Poly-Enrich也能很好地发挥作用,通过使用Poly-Enrich来表征富含不同重复元件家族的基因区域的通路和类型可以说明这一点。通过将Poly-Enrich和ChIP-Enrich的结果与ENCODE ChIP-seq数据进行比较,我们发现最佳测试更多地取决于所调控的通路,而不是转录因子的特性。利用已知的转录因子功能,我们发现与Poly-Enrich一致的相关生物学过程的簇能得到更好的建模。这表明某些过程的调控可能会被多个结合事件所改变,通过基于计数的方法能得到更好的建模。我们新的混合方法会自动为每个基因集使用最佳方法,并进行正确的FDR校正。