Department of Biological Sciences, Columbia University, New York, United States.
Department of Systems Biology, Columbia University, New York, United States.
Elife. 2021 Nov 22;10:e71513. doi: 10.7554/eLife.71513.
Whole exome sequences have now been collected for millions of humans, with the related goals of identifying pathogenic mutations in patients and establishing reference repositories of data from unaffected individuals. As a result, we are approaching an important limit, in which datasets are large enough that, in the absence of natural selection, every highly mutable site will have experienced at least one mutation in the genealogical history of the sample. Here, we focus on CpG sites that are methylated in the germline and experience mutations to T at an elevated rate of ~10 per site per generation; considering synonymous mutations in a sample of 390,000 individuals, ~ 99 % of such CpG sites harbor a C/T polymorphism. Methylated CpG sites provide a natural mutation saturation experiment for fitness effects: as we show, at nt sample sizes, not seeing a non-synonymous polymorphism is indicative of strong selection against that mutation. We rely on this idea in order to directly identify a subset of CpG transitions that are likely to be highly deleterious, including ~27 % of possible loss-of-function mutations, and up to 20 % of possible missense mutations, depending on the type of functional site in which they occur. Unlike methylated CpGs, most mutation types, with rates on the order of 10 or 10, remain very far from saturation. We discuss what these findings imply for interpreting the potential clinical relevance of mutations from their presence or absence in reference databases and for inferences about the fitness effects of new mutations.
现在已经为数百万人收集了整个外显子组序列,其相关目标是识别患者中的致病性突变,并建立未受影响个体数据的参考数据库。因此,我们即将达到一个重要的限制,即数据集足够大,以至于在没有自然选择的情况下,每个高度易变的位点在样本的系谱历史中至少经历过一次突变。在这里,我们关注的是在生殖系中甲基化且以约 10 个/代的速率突变为 T 的 CpG 位点;在 39 万个人的样本中考虑同义突变,约 99%的此类 CpG 位点携带 C/T 多态性。甲基化的 CpG 位点为适合度效应提供了自然的突变饱和实验:正如我们所表明的,在 nt 样本大小下,没有观察到非同义多态性表明该突变受到强烈的选择压力。我们依赖于这一想法,以便直接识别出一组可能高度有害的 CpG 转换,包括约 27%的可能无功能突变和高达 20%的可能错义突变,具体取决于它们发生的功能位点的类型。与甲基化的 CpG 不同,大多数突变类型的速率为 10 或 10,仍然远未达到饱和。我们讨论了这些发现对从参考数据库中存在或不存在的突变来解释其潜在临床相关性以及对新突变的适合度效应的推断意味着什么。