Stricker Georg, Engelhardt Alexander, Schulz Daniel, Schmid Matthias, Tresch Achim, Gagneur Julien
Gene Center and Department of Biochemistry, Ludwig-Maximilians-Universität München, 80333 Munich, Germany.
Department of Informatics, Technische Universität München, 85748 Garching, Germany.
Bioinformatics. 2017 Aug 1;33(15):2258-2265. doi: 10.1093/bioinformatics/btx150.
Chromatin immunoprecipitation followed by deep sequencing (ChIP-Seq) is a widely used approach to study protein-DNA interactions. Often, the quantities of interest are the differential occupancies relative to controls, between genetic backgrounds, treatments, or combinations thereof. Current methods for differential occupancy of ChIP-Seq data rely however on binning or sliding window techniques, for which the choice of the window and bin sizes are subjective.
Here, we present GenoGAM (Genome-wide Generalized Additive Model), which brings the well-established and flexible generalized additive models framework to genomic applications using a data parallelism strategy. We model ChIP-Seq read count frequencies as products of smooth functions along chromosomes. Smoothing parameters are objectively estimated from the data by cross-validation, eliminating ad hoc binning and windowing needed by current approaches. GenoGAM provides base-level and region-level significance testing for full factorial designs. Application to a ChIP-Seq dataset in yeast showed increased sensitivity over existing differential occupancy methods while controlling for type I error rate. By analyzing a set of DNA methylation data and illustrating an extension to a peak caller, we further demonstrate the potential of GenoGAM as a generic statistical modeling tool for genome-wide assays.
Software is available from Bioconductor: https://www.bioconductor.org/packages/release/bioc/html/GenoGAM.html .
Supplementary information is available at Bioinformatics online.
染色质免疫沉淀测序(ChIP-Seq)是一种广泛用于研究蛋白质与DNA相互作用的方法。通常,感兴趣的量是相对于对照、不同遗传背景、处理或它们的组合之间的差异占有率。然而,目前用于ChIP-Seq数据差异占有率的方法依赖于分箱或滑动窗口技术,而窗口和箱大小的选择是主观的。
在这里,我们提出了GenoGAM(全基因组广义相加模型),它使用数据并行策略将成熟且灵活的广义相加模型框架引入基因组应用。我们将ChIP-Seq读取计数频率建模为沿染色体的平滑函数的乘积。平滑参数通过交叉验证从数据中客观估计,消除了当前方法所需的临时分箱和加窗操作。GenoGAM为全因子设计提供碱基水平和区域水平的显著性检验。在酵母的ChIP-Seq数据集中的应用表明,在控制I型错误率的同时,其灵敏度高于现有的差异占有率方法。通过分析一组DNA甲基化数据并说明对峰识别器的扩展,我们进一步证明了GenoGAM作为全基因组分析通用统计建模工具的潜力。
软件可从Bioconductor获取:https://www.bioconductor.org/packages/release/bioc/html/GenoGAM.html 。
补充信息可在《生物信息学》在线获取。