Y C Brandt Débora, Wei Xinzhu, Deng Yun, Vaughn Andrew H, Nielsen Rasmus
Department of Integrative Biology, University of California Berkeley, Berkeley, CA 94720, USA.
Department of Computational Biology, Cornell University, Ithaca, NY 14850, USA.
Genetics. 2022 May 5;221(1). doi: 10.1093/genetics/iyac044.
The ancestral recombination graph is a structure that describes the joint genealogies of sampled DNA sequences along the genome. Recent computational methods have made impressive progress toward scalably estimating whole-genome genealogies. In addition to inferring the ancestral recombination graph, some of these methods can also provide ancestral recombination graphs sampled from a defined posterior distribution. Obtaining good samples of ancestral recombination graphs is crucial for quantifying statistical uncertainty and for estimating population genetic parameters such as effective population size, mutation rate, and allele age. Here, we use standard neutral coalescent simulations to benchmark the estimates of pairwise coalescence times from 3 popular ancestral recombination graph inference programs: ARGweaver, Relate, and tsinfer+tsdate. We compare (1) the true coalescence times to the inferred times at each locus; (2) the distribution of coalescence times across all loci to the expected exponential distribution; (3) whether the sampled coalescence times have the properties expected of a valid posterior distribution. We find that inferred coalescence times at each locus are most accurate in ARGweaver, and often more accurate in Relate than in tsinfer+tsdate. However, all 3 methods tend to overestimate small coalescence times and underestimate large ones. Lastly, the posterior distribution of ARGweaver is closer to the expected posterior distribution than Relate's, but this higher accuracy comes at a substantial trade-off in scalability. The best choice of method will depend on the number and length of input sequences and on the goal of downstream analyses, and we provide guidelines for the best practices.
祖先重组图是一种描述沿基因组采样的DNA序列的联合系谱的结构。最近的计算方法在可扩展地估计全基因组系谱方面取得了令人瞩目的进展。除了推断祖先重组图外,其中一些方法还可以提供从定义的后验分布中采样的祖先重组图。获得良好的祖先重组图样本对于量化统计不确定性以及估计诸如有效种群大小、突变率和等位基因年龄等群体遗传参数至关重要。在这里,我们使用标准的中性合并模拟来对来自3个流行的祖先重组图推断程序(ARGweaver、Relate和tsinfer+tsdate)的成对合并时间估计进行基准测试。我们比较了:(1)每个位点的真实合并时间与推断时间;(2)所有位点的合并时间分布与预期的指数分布;(3)采样的合并时间是否具有有效后验分布所期望的属性。我们发现,ARGweaver中每个位点的推断合并时间最准确,Relate中的推断合并时间通常比tsinfer+tsdate中的更准确。然而,所有这3种方法都倾向于高估小的合并时间而低估大的合并时间。最后,ARGweaver的后验分布比Relate的更接近预期的后验分布,但这种更高的准确性是以可扩展性方面的巨大权衡为代价的。最佳方法的选择将取决于输入序列的数量和长度以及下游分析的目标,并且我们提供了最佳实践指南。