在估计群体平均多基因评分历史的背景下评估ARG估计方法。

Evaluating ARG-estimation methods in the context of estimating population-mean polygenic score histories.

作者信息

Peng Dandan, Mulder Obadiah J, Edge Michael D

机构信息

Department of Quantitative and Computational Biology, University of Southern California.

出版信息

bioRxiv. 2024 Dec 20:2024.05.24.595829. doi: 10.1101/2024.05.24.595829.

DOI:10.1101/2024.05.24.595829

PMID:38854009

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11160635/

Abstract

Scalable methods for estimating marginal coalescent trees across the genome present new opportunities for studying evolution and have generated considerable excitement, with new methods extending scalability to thousands of samples. Benchmarking of the available methods has revealed general tradeoffs between accuracy and scalability, but performance in downstream applications has not always been easily predictable from general performance measures, suggesting that specific features of the ARG may be important for specific downstream applications of estimated ARGs. To exemplify this point, we benchmark ARG estimation methods with respect to a specific set of methods for estimating the historical time course of a population-mean polygenic score (PGS) using the marginal coalescent trees encoded by the ancestral recombination graph (ARG). Here we examine the performance in simulation of seven ARG estimation methods: ARGweaver, RENT+, Relate, tsinfer+tsdate, ARG-Needle, ASMC-clust, and SINGER, using their estimated coalescent trees and examining bias, mean squared error (MSE), confidence interval coverage, and Type I and II error rates of the downstream methods. Although it does not scale to the sample sizes attainable by other new methods, SINGER produced the most accurate estimated PGS histories in many instances, even when Relate, tsinfer+tsdate, ARG-Needle and ASMC-clust used samples ten or more times as large as those used by SINGER. In general, the best choice of method depends on the number of samples available and the historical time period of interest. In particular, the unprecedented sample sizes allowed by Relate, tsinfer+tsdate, ARG-Needle, and ASMC-clust are of greatest importance when the recent past is of interest-further back in time, most of the tree has coalesced, and differences in contemporary sample size are less salient.

摘要

用于估计全基因组边际合并树的可扩展方法为研究进化带来了新机遇，并引发了广泛关注，新方法将可扩展性扩展到了数千个样本。对现有方法的基准测试揭示了准确性和可扩展性之间的一般权衡，但从一般性能指标来看，下游应用中的性能并不总是易于预测的，这表明祖先重组图（ARG）的特定特征对于估计ARG的特定下游应用可能很重要。为了说明这一点，我们针对一组特定的方法对ARG估计方法进行基准测试，这些方法使用由祖先重组图（ARG）编码的边际合并树来估计群体平均多基因评分（PGS）的历史时间进程。在这里，我们在模拟中检验了七种ARG估计方法的性能：ARGweaver、RENT +、Relate、tsinfer + tsdate、ARG - Needle、ASMC - clust和SINGER，使用它们估计的合并树，并检验下游方法的偏差、均方误差（MSE）、置信区间覆盖率以及I型和II型错误率。尽管SINGER无法扩展到其他新方法所能达到的样本量，但在许多情况下，它产生的估计PGS历史最为准确，即使Relate、tsinfer + tsdate、ARG - Needle和ASMC - clust使用的样本量是SINGER的十倍或更多倍。一般来说，最佳方法选择取决于可用样本数量和感兴趣的历史时间段。特别是，当关注最近的过去时，Relate、tsinfer + tsdate、ARG - Needle和ASMC - clust所允许的前所未有的样本量最为重要——时间回溯得更远，大部分树已经合并，当代样本量的差异就不那么明显了。