Bhaskar Anand, Kamm John A, Song Yun S
University of California, Berkeley.
Adv Appl Probab. 2012 Jun;44(2):408-428. doi: 10.1239/aap/1339878718.
Many applications in genetic analyses utilize sampling distributions, which describe the probability of observing a sample of DNA sequences randomly drawn from a population. In the one-locus case with special models of mutation such as the infinite-alleles model or the finite-alleles parent-independent mutation model, closed-form sampling distributions under the coalescent have been known for many decades. However, no exact formula is currently known for more general models of mutation that are of biological interest. In this paper, models with finitely-many alleles are considered, and an urn construction related to the coalescent is used to derive approximate closed-form sampling formulas for an arbitrary irreducible recurrent mutation model or for a reversible recurrent mutation model, depending on whether the number of distinct observed allele types is at most three or four, respectively. It is demonstrated empirically that the formulas derived here are highly accurate when the per-base mutation rate is low, which holds for many biological organisms.
基因分析中的许多应用都利用抽样分布,抽样分布描述了观察从总体中随机抽取的DNA序列样本的概率。在单基因座情况下,对于诸如无限等位基因模型或有限等位基因亲本独立突变模型等特殊突变模型,在溯祖理论下的封闭形式抽样分布已经为人所知数十年了。然而,对于目前具有生物学意义的更一般的突变模型,尚无确切公式。本文考虑了具有有限多个等位基因的模型,并使用与溯祖理论相关的瓮构造来推导近似封闭形式的抽样公式,对于任意不可约循环突变模型或可逆循环突变模型,分别取决于观察到的不同等位基因类型的数量最多是三个还是四个。经验证明,当每碱基突变率较低时,这里推导的公式非常准确,这在许多生物中都是成立的。