Wu Dai-Ying, Bittencourt Danielle, Stallcup Michael R, Siegmund Kimberly D
Department of Biochemistry and Molecular Biology, University of Southern California Norris Comprehensive Cancer Center, University of Southern California Los Angeles, CA, USA.
Department of Preventive Medicine, University of Southern California Norris Comprehensive Cancer Center, University of Southern California Los Angeles, CA, USA.
Front Genet. 2015 Apr 29;6:169. doi: 10.3389/fgene.2015.00169. eCollection 2015.
ChIP seq is a widely used assay to measure genome-wide protein binding. The decrease in costs associated with sequencing has led to a rise in the number of studies that investigate protein binding across treatment conditions or cell lines. In addition to the identification of binding sites, new studies evaluate the variation in protein binding between conditions. A number of approaches to study differential transcription factor binding have recently been developed. Several of these methods build upon established methods from RNA-seq to quantify differences in read counts. We compare how these new approaches perform on different data sets from the ENCODE project to illustrate the impact of data processing pipelines under different study designs. The performance of normalization methods for differential ChIP-seq depends strongly on the variation in total amount of protein bound between conditions, with total read count outperforming effective library size, or variants thereof, when a large variation in binding was studied. Use of input subtraction to correct for non-specific binding showed a relatively modest impact on the number of differential peaks found and the fold change accuracy to biological validation, however a larger impact might be expected for samples with more extreme copy number variations between them. Still, it did identify a small subset of novel differential regions while excluding some differential peaks in regions with high background signal. These results highlight proper scaling for between-sample data normalization as critical for differential transcription factor binding analysis and suggest bioinformaticians need to know about the variation in level of total protein binding between conditions to select the best analysis method. At the same time, validation using fold-change estimates from qRT-PCR suggests there is still room for further method improvement.
染色质免疫沉淀测序(ChIP seq)是一种广泛用于测量全基因组蛋白质结合的分析方法。测序成本的降低导致了在不同处理条件或细胞系中研究蛋白质结合的研究数量增加。除了识别结合位点外,新的研究还评估了不同条件之间蛋白质结合的差异。最近已经开发了多种研究差异转录因子结合的方法。其中一些方法基于RNA测序的既定方法来量化读数计数的差异。我们比较了这些新方法在来自ENCODE项目的不同数据集上的表现,以说明不同研究设计下数据处理流程的影响。差异ChIP-seq的标准化方法的性能在很大程度上取决于不同条件之间结合的蛋白质总量的差异,当研究结合存在较大差异时,总读数计数比有效文库大小或其变体表现更好。使用输入扣除来校正非特异性结合对发现的差异峰数量和与生物学验证的倍数变化准确性的影响相对较小,然而,对于它们之间具有更极端拷贝数变化的样本,可能会预期有更大的影响。尽管如此,它确实识别出了一小部分新的差异区域,同时排除了一些背景信号高的区域中的差异峰。这些结果强调了样本间数据标准化的适当缩放对于差异转录因子结合分析至关重要,并表明生物信息学家需要了解不同条件之间总蛋白质结合水平的差异,以选择最佳分析方法。同时,使用qRT-PCR的倍数变化估计进行验证表明仍有进一步改进方法的空间。