Ki Caleb, Terhorst Jonathan
Department of Statistics, University of Michigan.
J Am Stat Assoc. 2024;119(547):2242-2255. doi: 10.1080/01621459.2023.2252570. Epub 2023 Oct 3.
In statistical genetics, the sequentially Markov coalescent (SMC) is an important family of models for approximating the distribution of genetic variation data under complex evolutionary models. Methods based on SMC are widely used in genetics and evolutionary biology, with significant applications to genotype phasing and imputation, recombination rate estimation, and inferring population history. SMC allows for likelihood-based inference using hidden Markov models (HMMs), where the latent variable represents a genealogy. Because genealogies are continuous, while HMMs are discrete, SMC requires discretizing the space of trees in a way that is awkward and creates bias. In this work, we propose a method that circumvents this requirement, enabling SMC-based inference to be performed in the natural setting of a continuous state space. We derive fast, exact procedures for frequentist and Bayesian inference using SMC. Compared to existing methods, ours requires minimal user intervention or parameter tuning, no numerical optimization or E-M, and is faster and more accurate.
在统计遗传学中,序列马尔可夫合并模型(SMC)是一类重要的模型家族,用于在复杂进化模型下近似遗传变异数据的分布。基于SMC的方法在遗传学和进化生物学中被广泛使用,在基因型定相和插补、重组率估计以及推断种群历史等方面有重要应用。SMC允许使用隐马尔可夫模型(HMM)进行基于似然的推断,其中潜在变量表示一个谱系。由于谱系是连续的,而HMM是离散的,SMC需要以一种笨拙且会产生偏差的方式对树的空间进行离散化。在这项工作中,我们提出了一种规避此要求的方法,使基于SMC的推断能够在连续状态空间的自然环境中进行。我们推导出了使用SMC进行频率主义和贝叶斯推断的快速、精确程序。与现有方法相比,我们的方法需要最少的用户干预或参数调整,无需数值优化或期望最大化算法,并且更快、更准确。