Department of Zoology, University of Oxford, Oxford OX1 3SY, UK.
Syst Biol. 2019 Sep 1;68(5):730-743. doi: 10.1093/sysbio/syz008.
The coalescent process describes how changes in the size or structure of a population influence the genealogical patterns of sequences sampled from that population. The estimation of (effective) population size changes from genealogies that are reconstructed from these sampled sequences is an important problem in many biological fields. Often, population size is characterized by a piecewise-constant function, with each piece serving as a population size parameter to be estimated. Estimation quality depends on both the statistical coalescent inference method employed, and on the experimental protocol, which controls variables such as the sampling of sequences through time and space, or the transformation of model parameters. While there is an extensive literature on coalescent inference methodology, there is comparatively little work on experimental design. The research that does exist is largely simulation-based, precluding the development of provable or general design theorems. We examine three key design problems: temporal sampling of sequences under the skyline demographic coalescent model, spatio-temporal sampling under the structured coalescent model, and time discretization for sequentially Markovian coalescent models. In all cases, we prove that 1) working in the logarithm of the parameters to be inferred (e.g., population size) and 2) distributing informative coalescent events uniformly among these log-parameters, is uniquely robust. "Robust" means that the total and maximum uncertainty of our parameter estimates are minimized, and made insensitive to their unknown (true) values. This robust design theorem provides rigorous justification for several existing coalescent experimental design decisions and leads to usable guidelines for future empirical or simulation-based investigations. Given its persistence among models, this theorem may form the basis of an experimental design paradigm for coalescent inference.
合并过程描述了种群大小或结构的变化如何影响从该种群中采样的序列的系统发育模式。从这些采样序列重建的系统发育中估计(有效)种群大小变化是许多生物学领域的一个重要问题。通常,种群大小由分段常数函数来描述,每个片段作为要估计的种群大小参数。估计质量既取决于所使用的统计合并推断方法,也取决于实验方案,该方案控制了变量,例如通过时间和空间对序列的采样,或模型参数的转换。虽然有大量关于合并推断方法的文献,但关于实验设计的文献相对较少。现有的研究主要是基于模拟的,从而排除了可证明或通用设计定理的发展。我们检查了三个关键的设计问题:在天际线合并模型下的序列时间采样,在结构化合并模型下的时空采样,以及顺序马尔可夫合并模型的时间离散化。在所有情况下,我们证明了 1)在要推断的参数的对数(例如,种群大小)中工作,以及 2)在这些对数参数中均匀分布信息丰富的合并事件,是唯一稳健的。“稳健”意味着我们的参数估计的总不确定性和最大不确定性最小化,并且对其未知(真实)值不敏感。这个稳健的设计定理为几种现有的合并实验设计决策提供了严格的理由,并为未来的经验或基于模拟的研究提供了有用的指导方针。鉴于它在模型中的持久性,这个定理可能成为合并推断实验设计范式的基础。