Department of Agriculture and Food Systems, Melbourne School of Land and Environment, University of Melbourne, Victoria, Australia.
Mol Biol Evol. 2013 Sep;30(9):2209-23. doi: 10.1093/molbev/mst125. Epub 2013 Jul 10.
Whole-genome sequence is potentially the richest source of genetic data for inferring ancestral demography. However, full sequence also presents significant challenges to fully utilize such large data sets and to ensure that sequencing errors do not introduce bias into the inferred demography. Using whole-genome sequence data from two Holstein cattle, we demonstrate a new method to correct for bias caused by hidden errors and then infer stepwise changes in ancestral demography up to present. There was a strong upward bias in estimates of recent effective population size (Ne) if the correction method was not applied to the data, both for our method and the Li and Durbin (Inference of human population history from individual whole-genome sequences. Nature 475:493-496) pairwise sequentially Markovian coalescent method. To infer demography, we use an analytical predictor of multiloci linkage disequilibrium (LD) based on a simple coalescent model that allows for changes in Ne. The LD statistic summarizes the distribution of runs of homozygosity for any given demography. We infer a best fit demography as one that predicts a match with the observed distribution of runs of homozygosity in the corrected sequence data. We use multiloci LD because it potentially holds more information about ancestral demography than pairwise LD. The inferred demography indicates a strong reduction in the Ne around 170,000 years ago, possibly related to the divergence of African and European Bos taurus cattle. This is followed by a further reduction coinciding with the period of cattle domestication, with Ne of between 3,500 and 6,000. The most recent reduction of Ne to approximately 100 in the Holstein breed agrees well with estimates from pedigrees. Our approach can be applied to whole-genome sequence from any diploid species and can be scaled up to use sequence from multiple individuals.
全基因组序列是推断祖先群体动态最丰富的遗传数据源。然而,完整的序列也为充分利用这些大数据集带来了重大挑战,并确保测序错误不会对推断的群体动态产生偏差。我们使用来自两头荷斯坦奶牛的全基因组序列数据,展示了一种新的方法,可以纠正由于隐藏错误引起的偏差,然后逐步推断到现在的祖先群体动态。如果不应用校正方法,我们的方法和 Li 和 Durbin(从个体全基因组序列推断人类种群历史。自然 475:493-496)的成对依次马尔可夫凝聚方法对最近有效种群大小(Ne)的估计都存在强烈的向上偏差。为了推断群体动态,我们使用了一种基于简单凝聚模型的多基因座连锁不平衡(LD)的分析预测器,该模型允许 Ne 发生变化。LD 统计量总结了任何给定群体动态的纯合性运行分布。我们推断出一个最佳拟合的群体动态,即一个能够预测校正序列数据中观察到的纯合性运行分布的匹配。我们使用多基因座 LD,因为它比成对 LD 更有可能包含有关祖先群体动态的信息。推断出的群体动态表明,大约在 17 万年前 Ne 大幅减少,可能与非洲和欧洲的牛属牛种的分化有关。随后,随着牛的驯化时期的到来,Ne 进一步减少,介于 3500 到 6000 之间。荷斯坦品种最近的 Ne 减少到大约 100,与系谱估计值非常吻合。我们的方法可以应用于任何二倍体物种的全基因组序列,并可以扩展到使用多个个体的序列。