Big Data Institute, Li Ka Shing Centre for Health Information and Discovery, University of Oxford, Oxford, UK.
Nat Genet. 2019 Sep;51(9):1330-1338. doi: 10.1038/s41588-019-0483-y. Epub 2019 Sep 2.
Inferring the full genealogical history of a set of DNA sequences is a core problem in evolutionary biology, because this history encodes information about the events and forces that have influenced a species. However, current methods are limited, and the most accurate techniques are able to process no more than a hundred samples. As datasets that consist of millions of genomes are now being collected, there is a need for scalable and efficient inference methods to fully utilize these resources. Here we introduce an algorithm that is able to not only infer whole-genome histories with comparable accuracy to the state-of-the-art but also process four orders of magnitude more sequences. The approach also provides an 'evolutionary encoding' of the data, enabling efficient calculation of relevant statistics. We apply the method to human data from the 1000 Genomes Project, Simons Genome Diversity Project and UK Biobank, showing that the inferred genealogies are rich in biological signal and efficient to process.
推断一组 DNA 序列的完整谱系历史是进化生物学中的核心问题,因为这段历史编码了影响物种的事件和力量的信息。然而,目前的方法存在局限性,最准确的技术能够处理的样本不超过一百个。由于现在正在收集包含数百万个基因组的数据集,因此需要可扩展和高效的推断方法来充分利用这些资源。在这里,我们介绍了一种算法,它不仅能够以与最先进技术相当的准确性推断全基因组历史,还能够处理四个数量级更多的序列。该方法还提供了数据的“进化编码”,能够有效地计算相关统计信息。我们将该方法应用于来自 1000 基因组计划、西蒙斯基因组多样性计划和英国生物库的人类数据,结果表明,推断出的系统发育树富含生物学信号,并且处理效率很高。