Cappello Lorenzo, Palacios Julia A
Stanford University.
Ann Appl Stat. 2020 Jun;14(2):727-751. doi: 10.1214/19-AOAS1313.
Statistical inference of evolutionary parameters from molecular sequence data relies on coalescent models to account for the shared genealogical ancestry of the samples. However, inferential algorithms do not scale to available data sets. A strategy to improve computational efficiency is to rely on simpler coalescent and mutation models, resulting in smaller hidden state spaces. An estimate of the cardinality of the state-space of genealogical trees at different resolutions is essential to decide the best modeling strategy for a given dataset. To our knowledge, there is neither an exact nor approximate method to determine these cardinalities. We propose a sequential importance sampling algorithm to estimate the cardinality of the sample space of genealogical trees under different coalescent resolutions. Our sampling scheme proceeds sequentially across the set of combinatorial constraints imposed by the data, which in this work are completely linked sequences of DNA at a non recombining segment. We analyze the cardinality of different genealogical tree spaces on simulations to study the settings that favor coarser resolutions. We apply our method to estimate the cardinality of genealogical tree spaces from mtDNA data from the 1000 genomes and a sample from a Melanesian population at the -globin locus.
从分子序列数据推断进化参数依赖于合并模型来解释样本共享的谱系祖先。然而,推理算法无法扩展到可用数据集。提高计算效率的一种策略是依赖更简单的合并和突变模型,从而产生更小的隐藏状态空间。估计不同分辨率下谱系树状态空间的基数对于为给定数据集确定最佳建模策略至关重要。据我们所知,既没有精确方法也没有近似方法来确定这些基数。我们提出一种顺序重要性抽样算法来估计不同合并分辨率下谱系树样本空间的基数。我们的抽样方案按照数据施加的组合约束集顺序进行,在本研究中这些约束是在非重组片段上完全连锁的DNA序列。我们在模拟中分析不同谱系树空间的基数,以研究有利于更粗分辨率的设置。我们应用我们的方法来估计来自千人基因组计划的线粒体DNA数据以及来自美拉尼西亚人群的一个样本在β-珠蛋白基因座处谱系树空间的基数。