Department of Genome Sciences, University of Washington, Seattle, Washington, United States of America.
PLoS Genet. 2012;8(3):e1002599. doi: 10.1371/journal.pgen.1002599. Epub 2012 Mar 22.
The average individual is expected to harbor thousands of variants within non-coding genomic regions involved in gene regulation. However, it is currently not possible to interpret reliably the functional consequences of genetic variation within any given transcription factor recognition sequence. To address this, we comprehensively analyzed heritable genome-wide binding patterns of a major sequence-specific regulator (CTCF) in relation to genetic variability in binding site sequences across a multi-generational pedigree. We localized and quantified CTCF occupancy by ChIP-seq in 12 related and unrelated individuals spanning three generations, followed by comprehensive targeted resequencing of the entire CTCF-binding landscape across all individuals. We identified hundreds of variants with reproducible quantitative effects on CTCF occupancy (both positive and negative). While these effects paralleled protein-DNA recognition energetics when averaged, they were extensively buffered by striking local context dependencies. In the significant majority of cases buffering was complete, resulting in silent variants spanning every position within the DNA recognition interface irrespective of level of binding energy or evolutionary constraint. The prevalence of complex partial or complete buffering effects severely constrained the ability to predict reliably the impact of variation within any given binding site instance. Surprisingly, 40% of variants that increased CTCF occupancy occurred at positions of human-chimp divergence, challenging the expectation that the vast majority of functional regulatory variants should be deleterious. Our results suggest that, even in the presence of "perfect" genetic information afforded by resequencing and parallel studies in multiple related individuals, genomic site-specific prediction of the consequences of individual variation in regulatory DNA will require systematic coupling with empirical functional genomic measurements.
一般个体预计在参与基因调控的非编码基因组区域内拥有数千种变体。然而,目前还无法可靠地解释给定转录因子识别序列中遗传变异的功能后果。为了解决这个问题,我们全面分析了一个主要序列特异性调节剂(CTCF)在多代系谱中与结合位点序列遗传变异性相关的可遗传全基因组结合模式。我们通过 ChIP-seq 在 12 个相关和不相关的个体中定位和量化了 CTCF 的占有率,跨越了三代,随后对所有个体的整个 CTCF 结合景观进行了全面的靶向重测序。我们确定了数百个具有可重复的定量效应的变体,这些变体对 CTCF 占有率(包括正和负)有影响。虽然这些效应在平均时与蛋白质-DNA 识别能量学相平行,但它们受到显著的局部上下文依赖性的强烈缓冲。在绝大多数情况下,缓冲是完全的,导致沉默的变体跨越 DNA 识别界面的每个位置,无论结合能或进化约束的水平如何。复杂的部分或完全缓冲效应的普遍性严重限制了可靠预测给定结合位点实例中变异影响的能力。令人惊讶的是,40%增加 CTCF 占有率的变体发生在人类和黑猩猩分化的位置,这挑战了绝大多数功能调节变体应该是有害的预期。我们的研究结果表明,即使在重新测序和多个相关个体的平行研究提供的“完美”遗传信息的情况下,对个体变异在调节 DNA 中的后果进行基因组特异性预测,也需要与经验性功能基因组测量系统地耦合。