Department of Mathematics, Statistics, and Computer Science, St. Olaf College, Northfield, MN 55057, USA,
Pac Symp Biocomput. 2020;25:719-730.
The popularization of biobanks provides an unprecedented amount of genetic and phenotypic information that can be used to research the relationship between genetics and human health. Despite the opportunities these datasets provide, they also pose many problems associated with computational time and costs, data size and transfer, and privacy and security. The publishing of summary statistics from these biobanks, and the use of them in a variety of downstream statistical analyses, alleviates many of these logistical problems. However, major questions remain about how to use summary statistics in all but the simplest downstream applications. Here, we present a novel approach to utilize basic summary statistics (estimates from single marker regressions on single phenotypes) to evaluate more complex phenotypes using multivariate methods. In particular, we present a covariate-adjusted method for conducting principal component analysis (PCA) utilizing only biobank summary statistics. We validate exact formulas for this method, as well as provide a framework of estimation when specific summary statistics are not available, through simulation. We apply our method to a real data set of fatty acid and genomic data.
生物库的普及提供了前所未有的遗传和表型信息,可用于研究遗传与人类健康之间的关系。尽管这些数据集提供了很多机会,但它们也带来了与计算时间和成本、数据大小和传输以及隐私和安全相关的许多问题。发布这些生物库的汇总统计信息,并将其用于各种下游统计分析中,缓解了许多这些后勤问题。然而,在除了最简单的下游应用之外,如何使用汇总统计信息仍然存在重大问题。在这里,我们提出了一种利用基本汇总统计数据(单标记回归对单表型的估计)使用多元方法评估更复杂表型的新方法。具体来说,我们提出了一种仅使用生物库汇总统计数据进行主成分分析(PCA)的协变量调整方法。我们通过模拟验证了该方法的确切公式,并提供了在特定汇总统计信息不可用时的估计框架。我们将我们的方法应用于脂肪酸和基因组数据的真实数据集。