Genomics and Computational Biology Graduate Group, Perelman School of Medicine, University of Pennsylvania, Philadelphia, Pennsylvania, United States of America.
Department of Statistics and Data Science, The Wharton School at the University of Pennsylvania, Philadelphia, Pennsylvania, United States of America.
PLoS Genet. 2023 Jul 7;19(7):e1010807. doi: 10.1371/journal.pgen.1010807. eCollection 2023 Jul.
Germline mutation is the mechanism by which genetic variation in a population is created. Inferences derived from mutation rate models are fundamental to many population genetics methods. Previous models have demonstrated that nucleotides flanking polymorphic sites-the local sequence context-explain variation in the probability that a site is polymorphic. However, limitations to these models exist as the size of the local sequence context window expands. These include a lack of robustness to data sparsity at typical sample sizes, lack of regularization to generate parsimonious models and lack of quantified uncertainty in estimated rates to facilitate comparison between models. To address these limitations, we developed Baymer, a regularized Bayesian hierarchical tree model that captures the heterogeneous effect of sequence contexts on polymorphism probabilities. Baymer implements an adaptive Metropolis-within-Gibbs Markov Chain Monte Carlo sampling scheme to estimate the posterior distributions of sequence-context based probabilities that a site is polymorphic. We show that Baymer accurately infers polymorphism probabilities and well-calibrated posterior distributions, robustly handles data sparsity, appropriately regularizes to return parsimonious models, and scales computationally at least up to 9-mer context windows. We demonstrate application of Baymer in three ways-first, identifying differences in polymorphism probabilities between continental populations in the 1000 Genomes Phase 3 dataset, second, in a sparse data setting to examine the use of polymorphism models as a proxy for de novo mutation probabilities as a function of variant age, sequence context window size, and demographic history, and third, comparing model concordance between different great ape species. We find a shared context-dependent mutation rate architecture underlying our models, enabling a transfer-learning inspired strategy for modeling germline mutations. In summary, Baymer is an accurate polymorphism probability estimation algorithm that automatically adapts to data sparsity at different sequence context levels, thereby making efficient use of the available data.
种系突变是群体中遗传变异产生的机制。突变率模型得出的推论是许多群体遗传学方法的基础。以前的模型表明,多态性位点侧翼的核苷酸——局部序列环境——解释了一个位点多态性的概率变化。然而,随着局部序列环境窗口的扩大,这些模型存在一些局限性。这些局限性包括在典型样本大小的数据稀疏性方面缺乏稳健性、缺乏正则化以生成简约模型以及缺乏估计率的量化不确定性,以促进模型之间的比较。为了解决这些局限性,我们开发了 Baymer,这是一种正则化的贝叶斯分层树模型,可捕获序列环境对多态性概率的异质性影响。Baymer 实现了一种自适应的 Metropolis-within-Gibbs 马尔可夫链蒙特卡罗抽样方案,以估计基于序列上下文的位点多态性概率的后验分布。我们表明,Baymer 可以准确推断多态性概率和校准良好的后验分布,稳健地处理数据稀疏性,适当正则化以返回简约模型,并在计算上至少扩展到 9 -mer 上下文窗口。我们以三种方式展示了 Baymer 的应用——首先,在 1000 个基因组第 3 阶段数据集的大陆群体中识别多态性概率的差异,其次,在稀疏数据设置中,研究作为新突变概率的替代物的多态性模型的使用作为变体年龄、序列上下文窗口大小和人口历史的函数,最后,比较不同类人猿物种之间的模型一致性。我们发现我们的模型具有共享的依赖上下文的突变率结构,从而为基于机器学习的种系突变建模提供了一种受转移学习启发的策略。总之,Baymer 是一种准确的多态性概率估计算法,它可以自动适应不同序列上下文水平的数据稀疏性,从而有效地利用可用数据。