Program in Computational Biology and Bioinformatics, Department of Molecular Biophysics and Biochemistry, W.M. Keck Foundation Biotechnology Resource Laboratory, Yale University, New Haven, CT 06520, USA.
Nucleic Acids Res. 2011 Sep 1;39(16):7058-76. doi: 10.1093/nar/gkr342. Epub 2011 May 19.
In the human genome, it has been estimated that considerably more sequence is under natural selection in non-coding regions [such as transcription-factor binding sites (TF-binding sites) and non-coding RNAs (ncRNAs)] compared to protein-coding ones. However, less attention has been paid to them. To study selective pressure on non-coding elements, we use next-generation sequencing data from the recently completed pilot phase of the 1000 Genomes Project, which, compared to traditional methods, allows for the characterization of a full spectrum of genomic variations, including single-nucleotide polymorphisms (SNPs), short insertions and deletions (indels) and structural variations (SVs). We develop a framework for combining these variation data with non-coding elements, calculating various population-based metrics to compare classes and subclasses of elements, and developing element-aware aggregation procedures to probe the internal structure of an element. Overall, we find that TF-binding sites and ncRNAs are less selectively constrained for SNPs than coding sequences (CDSs), but more constrained than a neutral reference. We also determine that the relative amounts of constraint for the three types of variations are, in general, correlated, but there are some differences: counter-intuitively, TF-binding sites and ncRNAs are more selectively constrained for indels than for SNPs, compared to CDSs. After inspecting the overall properties of a class of elements, we analyze selective pressure on subclasses within an element class, and show that the extent of selection is associated with the genomic properties of each subclass. We find, for instance, that ncRNAs with higher expression levels tend to be under stronger purifying selection, and the actual regions of TF-binding motifs are under stronger selective pressure than the corresponding peak regions. Further, we develop element-aware aggregation plots to analyze selective pressure across the linear structure of an element, with the confidence intervals evaluated using both simple bootstrapping and block bootstrapping techniques. We find, for example, that both micro-RNAs (particularly the seed regions) and their binding targets are under stronger selective pressure for SNPs than their immediate genomic surroundings. In addition, we demonstrate that substitutions in TF-binding motifs inversely correlate with site conservation, and SNPs unfavorable for motifs are under more selective constraints than favorable SNPs. Finally, to further investigate intra-element differences, we show that SVs have the tendency to use distinctive modes and mechanisms when they interact with genomic elements, such as enveloping whole gene(s) rather than disrupting them partially, as well as duplicating TF motifs in tandem.
在人类基因组中,与编码蛋白的基因相比,非编码区(如转录因子结合位点[TF-binding sites]和非编码 RNA[ncRNAs])的序列受到自然选择的影响要大得多。然而,人们对它们的关注较少。为了研究非编码元件的选择压力,我们使用了最近完成的 1000 基因组计划先导阶段的下一代测序数据,与传统方法相比,这种方法可以描述包括单核苷酸多态性(SNPs)、短插入和缺失(indels)以及结构变异(SVs)在内的全基因组变异。我们开发了一种将这些变异数据与非编码元件相结合的框架,计算各种基于群体的指标来比较元件的类别和子类,并开发了基于元件的聚合过程来探测元件的内部结构。总的来说,我们发现 TF-binding sites 和 ncRNAs 受到 SNPs 的选择限制比编码序列(CDSs)少,但比中性参考多。我们还确定,三种类型的变异之间的约束程度通常是相关的,但也存在一些差异:与直觉相反,与 CDSs 相比,TF-binding sites 和 ncRNAs 受到 indels 的选择限制比 SNPs 更大。在检查了一类元件的整体性质后,我们分析了元件类内子类的选择压力,并表明选择的程度与每个子类的基因组特性相关。例如,我们发现表达水平较高的 ncRNAs 往往受到更强的净化选择,而 TF-binding 基序的实际区域受到的选择压力比相应的峰值区域更强。此外,我们开发了基于元件的聚合图来分析元件线性结构上的选择压力,置信区间使用简单的自举和块自举技术进行评估。例如,我们发现,miRNAs(特别是种子区域)及其靶标受到 SNPs 的选择压力比其周围的基因组更强。此外,我们还证明了 TF-binding 基序中的替换与位点保守性呈反比,对基序不利的 SNPs 比有利的 SNPs 受到更强的选择限制。最后,为了进一步研究元件内的差异,我们发现 SVs 在与基因组元件相互作用时,具有使用独特模式和机制的趋势,例如包裹整个基因(而不是部分破坏它们),以及串联复制 TF 基序。