Department of Biostatistics, University of Michigan, Ann Arbor, MI 48109, USA.
Department of Integrative Biology, University of Texas at Austin, Austin, TX 78712, USA.
G3 (Bethesda). 2023 Apr 11;13(4). doi: 10.1093/g3journal/jkad035.
Population genetics has adapted as technological advances in next-generation sequencing have resulted in an exponential increase of genetic data. A common approach to efficiently analyze genetic variation present in large sequencing data is through the allele frequency spectrum, defined as the distribution of allele frequencies in a sample. While the frequency spectrum serves to summarize patterns of genetic variation, it implicitly assumes mutation types (A→C vs C→T) as interchangeable. However, mutations of different types arise and spread due to spatial and temporal variation in forces such as mutation rate and biased gene conversion that result in heterogeneity in the distribution of allele frequencies across sites. In this work, we explore the impact of this simplification on multiple aspects of population genetic modeling. As a site's mutation rate is strongly affected by flanking nucleotides, we defined a mutation subtype by the base pair change and adjacent nucleotides (e.g. AAA→ATA) and systematically assessed the heterogeneity in the frequency spectrum across 96 distinct 3-mer mutation subtypes using n = 3556 whole-genome sequenced individuals of European ancestry. We observed substantial variation across the subtype-specific frequency spectra, with some of the variation being influenced by molecular factors previously identified for single base mutation types. Estimates of model parameters from demographic inference performed for each mutation subtype's AFS individually varied drastically across the 96 subtypes. In local patterns of variation, a combination of regional subtype composition and local genomic factors shaped the regional frequency spectrum across genomic regions. Our results illustrate how treating variants in large sequencing samples as interchangeable may confound population genetic frameworks and encourages us to consider the unique evolutionary mechanisms of analyzed polymorphisms.
群体遗传学已经适应了下一代测序技术的进步,这些进步导致遗传数据呈指数级增长。一种分析大型测序数据中遗传变异的常用方法是通过等位基因频率谱,它定义为样本中等位基因频率的分布。虽然频谱有助于总结遗传变异的模式,但它隐含地假设突变类型(A→C 与 C→T)是可互换的。然而,由于突变率和偏向基因转换等因素在空间和时间上的变化,不同类型的突变会产生并传播,从而导致等位基因频率在不同位点的分布产生异质性。在这项工作中,我们探讨了这种简化对群体遗传建模多个方面的影响。由于一个位点的突变率受到侧翼核苷酸的强烈影响,我们通过碱基对变化和相邻核苷酸来定义突变亚型(例如 AAA→ATA),并使用 n = 3556 名具有欧洲血统的全基因组测序个体系统地评估了 96 种不同 3-碱基突变亚型的频谱在频率上的异质性。我们观察到在特定于亚型的频谱中存在大量的变异,其中一些变异受到先前为单碱基突变类型确定的分子因素的影响。为每个突变亚型的 AFS 单独进行的人口推断模型参数的估计在 96 个亚型之间变化很大。在局部变异模式中,区域亚型组成和局部基因组因素的组合塑造了整个基因组区域的区域频谱。我们的结果说明了将大型测序样本中的变体视为可互换可能会混淆群体遗传框架,并鼓励我们考虑所分析多态性的独特进化机制。