Department of Biostatistics and Computational Biology, Dana-Farber Cancer Institute, Boston, Massachusetts 02215, USA.
Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, Massachusetts 02115, USA.
Genome Res. 2017 Nov;27(11):1930-1938. doi: 10.1101/gr.220673.117. Epub 2017 Oct 12.
The main application of ChIP-seq technology is the detection of genomic regions that bind to a protein of interest. A large part of functional genomics' public catalogs is based on ChIP-seq data. These catalogs rely on peak calling algorithms that infer protein-binding sites by detecting genomic regions associated with more mapped reads (coverage) than expected by chance, as a result of the experimental protocol's lack of perfect specificity. We find that GC-content bias accounts for substantial variability in the observed coverage for ChIP-seq experiments and that this variability leads to false-positive peak calls. More concerning is that the GC effect varies across experiments, with the effect strong enough to result in a substantial number of peaks called differently when different laboratories perform experiments on the same cell line. However, accounting for GC content bias in ChIP-seq is challenging because the binding sites of interest tend to be more common in high GC-content regions, which confounds real biological signals with unwanted variability. To account for this challenge, we introduce a statistical approach that accounts for GC effects on both nonspecific noise and signal induced by the binding site. The method can be used to account for this bias in binding quantification as well to improve existing peak calling algorithms. We use this approach to show a reduction in false-positive peaks as well as improved consistency across laboratories.
ChIP-seq 技术的主要应用是检测与感兴趣的蛋白质结合的基因组区域。功能基因组学的公共目录很大一部分是基于 ChIP-seq 数据。这些目录依赖于峰调用算法,该算法通过检测与更多映射读取(覆盖范围)相关的基因组区域来推断蛋白质结合位点,这是由于实验方案缺乏完美的特异性。我们发现 GC 含量偏倚解释了 ChIP-seq 实验中观察到的覆盖范围的大量可变性,并且这种可变性导致了假阳性峰调用。更令人担忧的是,GC 效应在不同的实验中变化很大,其效应足够强,以至于当不同的实验室在同一细胞系上进行实验时,会导致大量的峰被不同地调用。然而,在 ChIP-seq 中考虑 GC 含量偏倚是具有挑战性的,因为感兴趣的结合位点往往在高 GC 含量区域更为常见,这使得真实的生物学信号与不必要的可变性混淆在一起。为了应对这一挑战,我们引入了一种统计方法,该方法考虑了 GC 对非特异性噪声和结合位点诱导的信号的影响。该方法可用于结合定量的这种偏差,也可用于改进现有的峰调用算法。我们使用这种方法来显示假阳性峰的减少以及实验室之间的一致性提高。