Zeng Xin, Li Bo, Welch Rene, Rojo Constanza, Zheng Ye, Dewey Colin N, Keleş Sündüz
Department of Statistics, University of Wisconsin, Madison, Wisconsin, United States of America.
California Institute for Quantitative Biosciences, University of California, Berkeley, California, United States of America.
PLoS Comput Biol. 2015 Oct 20;11(10):e1004491. doi: 10.1371/journal.pcbi.1004491. eCollection 2015 Oct.
Segmental duplications and other highly repetitive regions of genomes contribute significantly to cells' regulatory programs. Advancements in next generation sequencing enabled genome-wide profiling of protein-DNA interactions by chromatin immunoprecipitation followed by high throughput sequencing (ChIP-seq). However, interactions in highly repetitive regions of genomes have proven difficult to map since short reads of 50-100 base pairs (bps) from these regions map to multiple locations in reference genomes. Standard analytical methods discard such multi-mapping reads and the few that can accommodate them are prone to large false positive and negative rates. We developed Perm-seq, a prior-enhanced read allocation method for ChIP-seq experiments, that can allocate multi-mapping reads in highly repetitive regions of the genomes with high accuracy. We comprehensively evaluated Perm-seq, and found that our prior-enhanced approach significantly improves multi-read allocation accuracy over approaches that do not utilize additional data types. The statistical formalism underlying our approach facilitates supervising of multi-read allocation with a variety of data sources including histone ChIP-seq. We applied Perm-seq to 64 ENCODE ChIP-seq datasets from GM12878 and K562 cells and identified many novel protein-DNA interactions in segmental duplication regions. Our analysis reveals that although the protein-DNA interactions sites are evolutionarily less conserved in repetitive regions, they share the overall sequence characteristics of the protein-DNA interactions in non-repetitive regions.
基因组中的片段重复和其他高度重复区域对细胞的调控程序有重大贡献。下一代测序技术的进步使得通过染色质免疫沉淀后进行高通量测序(ChIP-seq)来对全基因组蛋白质-DNA相互作用进行分析成为可能。然而,基因组高度重复区域中的相互作用已被证明难以绘制图谱,因为来自这些区域的50-100个碱基对(bps)的短读段会映射到参考基因组中的多个位置。标准分析方法会丢弃这些多映射读段,而少数能够处理它们的方法容易出现高假阳性和假阴性率。我们开发了Perm-seq,一种用于ChIP-seq实验的先验增强读段分配方法,它可以在基因组的高度重复区域中高精度地分配多映射读段。我们全面评估了Perm-seq,发现我们的先验增强方法比不利用其他数据类型的方法显著提高了多读段分配的准确性。我们方法背后的统计形式有助于利用包括组蛋白ChIP-seq在内的各种数据源对多读段分配进行监督。我们将Perm-seq应用于来自GM12878和K-562细胞的64个ENCODE ChIP-seq数据集,并在片段重复区域中鉴定出许多新的蛋白质-DNA相互作用。我们的分析表明,尽管蛋白质-DNA相互作用位点在重复区域的进化上保守性较低,但它们具有非重复区域中蛋白质-DNA相互作用的整体序列特征。