Joyce Paul, Genz Alan, Buzbas Erkan Ozge
Department of Mathematics and Initiative for Bioinformatics and Evolutionary Studies, University of Idaho, Moscow, ID, USA.
J Comput Biol. 2012 Jun;19(6):650-61. doi: 10.1089/cmb.2012.0033.
Throughout the 1980s, Simon Tavaré made numerous significant contributions to population genetics theory. As genetic data, in particular DNA sequence, became more readily available, a need to connect population-genetic models to data became the central issue. The seminal work of Griffiths and Tavaré (1994a , 1994b , 1994c) was among the first to develop a likelihood method to estimate the population-genetic parameters using full DNA sequences. Now, we are in the genomics era where methods need to scale-up to handle massive data sets, and Tavaré has led the way to new approaches. However, performing statistical inference under non-neutral models has proved elusive. In tribute to Simon Tavaré, we present an article in spirit of his work that provides a computationally tractable method for simulating and analyzing data under a class of non-neutral population-genetic models. Computational methods for approximating likelihood functions and generating samples under a class of allele-frequency based non-neutral parent-independent mutation models were proposed by Donnelly, Nordborg, and Joyce (DNJ) (Donnelly et al., 2001). DNJ (2001) simulated samples of allele frequencies from non-neutral models using neutral models as auxiliary distribution in a rejection algorithm. However, patterns of allele frequencies produced by neutral models are dissimilar to patterns of allele frequencies produced by non-neutral models, making the rejection method inefficient. For example, in some cases the methods in DNJ (2001) require 10(9) rejections before a sample from the non-neutral model is accepted. Our method simulates samples directly from the distribution of non-neutral models, making simulation methods a practical tool to study the behavior of the likelihood and to perform inference on the strength of selection.
在整个20世纪80年代,西蒙·塔瓦雷对群体遗传学理论做出了众多重大贡献。随着遗传数据,尤其是DNA序列变得更容易获取,将群体遗传模型与数据联系起来的需求成为核心问题。格里菲思和塔瓦雷(1994a、1994b、1994c)的开创性工作是最早开发出一种使用完整DNA序列来估计群体遗传参数的似然方法的研究之一。如今,我们处于基因组学时代,方法需要扩大规模以处理海量数据集,而塔瓦雷引领了新方法的发展方向。然而,在非中性模型下进行统计推断已被证明是难以捉摸的。为了向西蒙·塔瓦雷致敬,我们发表一篇秉承他的工作精神的文章,该文章提供了一种计算上易于处理的方法,用于在一类非中性群体遗传模型下模拟和分析数据。唐纳利、诺德伯格和乔伊斯(DNJ)(唐纳利等人,2001年)提出了在一类基于等位基因频率的非中性亲本独立突变模型下近似似然函数和生成样本的计算方法。DNJ(2001年)在一种拒绝算法中使用中性模型作为辅助分布,从非中性模型模拟等位基因频率样本。然而,中性模型产生的等位基因频率模式与非中性模型产生的等位基因频率模式不同,这使得拒绝方法效率低下。例如,在某些情况下,DNJ(2001年)中的方法在接受一个来自非中性模型的样本之前需要10^9次拒绝。我们的方法直接从非中性模型的分布模拟样本,使模拟方法成为研究似然行为和对选择强度进行推断的实用工具。