Department of Physics, University of Chicago, Chicago, Illinois, United States of America.
Department of Ecology and Evolution, University of Chicago, Chicago, Illinois, United States of America.
PLoS Comput Biol. 2022 Sep 16;18(9):e1010419. doi: 10.1371/journal.pcbi.1010419. eCollection 2022 Sep.
Unraveling the complex demographic histories of natural populations is a central problem in population genetics. Understanding past demographic events is of general anthropological interest, but is also an important step in establishing accurate null models when identifying adaptive or disease-associated genetic variation. An important class of tools for inferring past population size changes from genomic sequence data are Coalescent Hidden Markov Models (CHMMs). These models make efficient use of the linkage information in population genomic datasets by using the local genealogies relating sampled individuals as latent states that evolve along the chromosome in an HMM framework. Extending these models to large sample sizes is challenging, since the number of possible latent states increases rapidly. Here, we present our method CHIMP (CHMM History-Inference Maximum-Likelihood Procedure), a novel CHMM method for inferring the size history of a population. It can be applied to large samples (hundreds of haplotypes) and only requires unphased genomes as input. The two implementations of CHIMP that we present here use either the height of the genealogical tree (TMRCA) or the total branch length, respectively, as the latent variable at each position in the genome. The requisite transition and emission probabilities are obtained by numerically solving certain systems of differential equations derived from the ancestral process with recombination. The parameters of the population size history are subsequently inferred using an Expectation-Maximization algorithm. In addition, we implement a composite likelihood scheme to allow the method to scale to large sample sizes. We demonstrate the efficiency and accuracy of our method in a variety of benchmark tests using simulated data and present comparisons to other state-of-the-art methods. Specifically, our implementation using TMRCA as the latent variable shows comparable performance and provides accurate estimates of effective population sizes in intermediate and ancient times. Our method is agnostic to the phasing of the data, which makes it a promising alternative in scenarios where high quality data is not available, and has potential applications for pseudo-haploid data.
揭示自然种群复杂的人口历史是群体遗传学的一个核心问题。了解过去的人口事件不仅具有普遍的人类学意义,而且在确定适应性或与疾病相关的遗传变异的准确零假设模型时,也是一个重要步骤。从基因组序列数据推断过去种群大小变化的一类重要工具是合并隐马尔可夫模型 (CHMM)。这些模型通过使用与采样个体相关的局部系统发育作为隐状态,在 HMM 框架中沿染色体演变,从而有效地利用了群体基因组数据集中的连锁信息。将这些模型扩展到较大的样本量是具有挑战性的,因为潜在状态的数量会迅速增加。在这里,我们提出了我们的方法 CHIMP(CHMM 历史推断最大似然过程),这是一种用于推断群体大小历史的新 CHMM 方法。它可以应用于大样本(数百个单倍型),并且只需要未相位基因组作为输入。我们在这里提出的两种 CHIMP 实现分别使用系统发育树的高度(TMRCA)或总分支长度作为基因组中每个位置的潜在变量。所需的转移和发射概率是通过数值求解从具有重组的祖先过程得出的某些微分方程系统获得的。随后使用期望最大化算法推断群体大小历史的参数。此外,我们实现了一种复合似然方案,以允许该方法扩展到较大的样本量。我们使用模拟数据在各种基准测试中证明了我们方法的效率和准确性,并与其他最先进的方法进行了比较。具体来说,我们使用 TMRCA 作为潜在变量的实现提供了可比的性能,并在中古代提供了有效种群大小的准确估计。我们的方法与数据的相位无关,这使其成为高质量数据不可用情况下的有前途的替代方法,并且在伪单倍体数据中有潜在的应用。