Crosslin David R, Tromp Gerard, Burt Amber, Kim Daniel S, Verma Shefali S, Lucas Anastasia M, Bradford Yuki, Crawford Dana C, Armasu Sebastian M, Heit John A, Hayes M Geoffrey, Kuivaniemi Helena, Ritchie Marylyn D, Jarvik Gail P, de Andrade Mariza
Division of Medical Genetics, Department of Medicine, University of Washington Seattle, WA, USA ; Department of Genome Sciences, University of Washington Seattle, WA, USA.
The Sigfried and Janet Weis Center for Research, Geisinger Health System Danville, PA, USA.
Front Genet. 2014 Nov 4;5:352. doi: 10.3389/fgene.2014.00352. eCollection 2014.
Combining samples across multiple cohorts in large-scale scientific research programs is often required to achieve the necessary power for genome-wide association studies. Controlling for genomic ancestry through principal component analysis (PCA) to address the effect of population stratification is a common practice. In addition to local genomic variation, such as copy number variation and inversions, other factors directly related to combining multiple studies, such as platform and site recruitment bias, can drive the correlation patterns in PCA. In this report, we describe the combination and analysis of multi-ethnic cohort with biobanks linked to electronic health records for large-scale genomic association discovery analyses. First, we outline the observed site and platform bias, in addition to ancestry differences. Second, we outline a general protocol for selecting variants for input into the subject variance-covariance matrix, the conventional PCA approach. Finally, we introduce an alternative approach to PCA by deriving components from subject loadings calculated from a reference sample. This alternative approach of generating principal components controlled for site and platform bias, in addition to ancestry differences, has the advantage of fewer covariates and degrees of freedom.
在大规模科研项目中,通常需要合并多个队列的样本,以获得全基因组关联研究所需的统计学效能。通过主成分分析(PCA)控制基因组祖先信息以解决群体分层效应是一种常见做法。除了局部基因组变异,如拷贝数变异和倒位,其他与合并多项研究直接相关的因素,如平台和位点招募偏差,也会影响PCA中的相关模式。在本报告中,我们描述了多民族队列与与电子健康记录相关的生物样本库的合并及分析,用于大规模基因组关联发现分析。首先,我们概述了观察到的位点和平台偏差以及祖先差异。其次,我们概述了一种选择变体输入到主体方差协方差矩阵的通用方案,即传统的PCA方法。最后,我们介绍了一种PCA的替代方法,通过从参考样本计算的主体负荷中导出成分。这种生成主成分的替代方法除了控制祖先差异外,还控制了位点和平台偏差,具有协变量和自由度较少的优点。