Terhorst Jonathan, Kamm John A, Song Yun S
Department of Statistics, University of California, Berkeley, Berkeley, California, USA.
Computer Science Division, University of California, Berkeley, Berkeley, California, USA.
Nat Genet. 2017 Feb;49(2):303-309. doi: 10.1038/ng.3748. Epub 2016 Dec 26.
It has recently been demonstrated that inference methods based on genealogical processes with recombination can uncover past population history in unprecedented detail. However, these methods scale poorly with sample size, limiting resolution in the recent past, and they require phased genomes, which contain switch errors that can catastrophically distort the inferred history. Here we present SMC++, a new statistical tool capable of analyzing orders of magnitude more samples than existing methods while requiring only unphased genomes (its results are independent of phasing). SMC++ can jointly infer population size histories and split times in diverged populations, and it employs a novel spline regularization scheme that greatly reduces estimation error. We apply SMC++ to analyze sequence data from over a thousand human genomes in Africa and Eurasia, hundreds of genomes from a Drosophila melanogaster population in Africa, and tens of genomes from zebra finch and long-tailed finch populations in Australia.
最近有研究表明,基于带有重组的谱系过程的推断方法能够以前所未有的细节揭示过去的种群历史。然而,这些方法随着样本量的增加扩展性较差,限制了对近期历史的分辨率,并且它们需要分阶段的基因组,而分阶段的基因组包含可能会严重扭曲推断历史的切换错误。在此,我们展示了SMC++,这是一种新的统计工具,它能够分析比现有方法多几个数量级的样本,同时只需要未分阶段的基因组(其结果与分阶段无关)。SMC++能够联合推断种群大小历史以及分化种群的分裂时间,并且它采用了一种新颖的样条正则化方案,极大地减少了估计误差。我们应用SMC++来分析来自非洲和欧亚大陆一千多个人类基因组的序列数据、来自非洲一个黑腹果蝇种群的数百个基因组,以及来自澳大利亚斑胸草雀和长尾草雀种群的数十个基因组。