Speed Maria Simonsen, Balding David Joseph, Hobolth Asger
Bioinformatics Research Centre, Aarhus University, 8000, Aarhus, Denmark.
Department of Affective Disorders, Aarhus University Hospital, 8000, Aarhus, Denmark.
J Math Biol. 2019 May;78(6):1727-1769. doi: 10.1007/s00285-018-01325-0. Epub 2019 Jan 28.
In population genetics, the Dirichlet (also called the Balding-Nichols) model has for 20 years been considered the key model to approximate the distribution of allele fractions within populations in a multi-allelic setting. It has often been noted that the Dirichlet assumption is approximate because positive correlations among alleles cannot be accommodated under the Dirichlet model. However, the validity of the Dirichlet distribution has never been systematically investigated in a general framework. This paper attempts to address this problem by providing a general overview of how allele fraction data under the most common multi-allelic mutational structures should be modeled. The Dirichlet and alternative models are investigated by simulating allele fractions from a diffusion approximation of the multi-allelic Wright-Fisher process with mutation, and applying a moment-based analysis method. The study shows that the optimal modeling strategy for the distribution of allele fractions depends on the specific mutation process. The Dirichlet model is only an exceptionally good approximation for the pure drift, Jukes-Cantor and parent-independent mutation processes with small mutation rates. Alternative models are required and proposed for the other mutation processes, such as a Beta-Dirichlet model for the infinite alleles mutation process, and a Hierarchical Beta model for the Kimura, Hasegawa-Kishino-Yano and Tamura-Nei processes. Finally, a novel Hierarchical Beta approximation is developed, a Pyramidal Hierarchical Beta model, for the generalized time-reversible and single-step mutation processes.
在群体遗传学中,狄利克雷(也称为鲍尔丁 - 尼科尔斯)模型在20年来一直被视为在多等位基因环境下近似群体中等位基因频率分布的关键模型。人们经常指出,狄利克雷假设是近似的,因为在狄利克雷模型下无法考虑等位基因之间的正相关。然而,狄利克雷分布的有效性从未在一个通用框架中得到系统研究。本文试图通过全面概述在最常见的多等位基因突变结构下如何对等位基因频率数据进行建模来解决这个问题。通过从具有突变的多等位基因赖特 - 费希尔过程的扩散近似中模拟等位基因频率,并应用基于矩的分析方法,对狄利克雷模型和替代模型进行了研究。研究表明,等位基因频率分布的最优建模策略取决于特定的突变过程。狄利克雷模型仅对纯漂移、朱克斯 - 坎托以及具有小突变率的与亲本无关的突变过程是一个非常好的近似。对于其他突变过程则需要并提出了替代模型,例如针对无限等位基因突变过程的贝塔 - 狄利克雷模型,以及针对木村、长谷川 - 木村 - 矢野和田村 - 内模型的分层贝塔模型。最后,针对广义时间可逆和单步突变过程开发了一种新颖的分层贝塔近似模型,即金字塔分层贝塔模型。