Department of Molecular Genetics and Microbiology, University of Florida, Gainesville, Florida, USA ; UF Genetics Institute, University of Florida, Gainesville, Florida, USA.
Human Genome Sequencing Center, Baylor College of Medicine, Houston, Texas, USA.
Comput Struct Biotechnol J. 2014 Jan 31;9:e201401002. doi: 10.5936/csbj.201401002. eCollection 2014.
ChIP-seq experiments identify genome-wide profiles of DNA-binding molecules including transcription factors, enzymes and epigenetic marks. Biological replicates are critical for reliable site discovery and are required for the deposition of data in the ENCODE and modENCODE projects. While early reports suggested two replicates were sufficient, the widespread application of the technique has led to emerging consensus that the technique is noisy and that increasing replication may be worthwhile. Additional biological replicates also allow for quantitative assessment of differences between conditions. To date it has remained controversial about how to confirm peak identification and to determine signal strength across biological replicates, particularly when the number of replicates is greater than two. Using objective metrics, we evaluate the consistency of biological replicates in ChIP-seq experiments with more than two replicates. We compare several approaches for binding site determination, including two popular but disparate peak callers, CisGenome and MACS2. Here we propose read coverage as a quantitative measurement of signal strength for estimating sample concordance. Determining binding based on genomic features, such as promoters, is also examined. We find that increasing the number of biological replicates increases the reliability of peak identification. Critically, binding sites with strong biological evidence may be missed if researchers rely on only two biological replicates. When more than two replicates are performed, a simple majority rule (>50% of samples identify a peak) identifies peaks more reliably in all biological replicates than the absolute concordance of peak identification between any two replicates, further demonstrating the utility of increasing replicate numbers in ChIP-seq experiments.
ChIP-seq 实验可识别包括转录因子、酶和表观遗传标记在内的 DNA 结合分子的全基因组图谱。生物学重复对于可靠的位点发现至关重要,并且是 ENCODE 和 modENCODE 项目中数据存储的必要条件。尽管早期的报告表明,两个重复就足够了,但该技术的广泛应用导致了一种新兴的共识,即该技术存在噪声,增加复制可能是值得的。额外的生物学重复还可以对条件之间的差异进行定量评估。迄今为止,关于如何确认峰识别以及如何在生物学重复之间确定信号强度,特别是当重复次数大于 2 时,仍然存在争议。我们使用客观指标评估了超过两个重复的 ChIP-seq 实验中生物学重复的一致性。我们比较了几种用于确定结合位点的方法,包括两种流行但不同的峰调用器 CisGenome 和 MACS2。在这里,我们提出了读取覆盖率作为信号强度的定量测量,用于估计样本一致性。基于基因组特征(如启动子)确定结合也进行了研究。我们发现,增加生物学重复的数量可以提高峰识别的可靠性。至关重要的是,如果研究人员仅依赖两个生物学重复,可能会错过具有强生物学证据的结合位点。当进行超过两个重复时,简单的多数规则(>50%的样本识别出一个峰)比任何两个重复之间的峰识别绝对一致性更可靠地识别所有生物学重复中的峰,进一步证明了在 ChIP-seq 实验中增加重复数量的实用性。