Suppr超能文献

在与电子健康记录相关联的eMERGE多机构生物样本库中控制群体结构和基因分型平台偏差。

Controlling for population structure and genotyping platform bias in the eMERGE multi-institutional biobank linked to electronic health records.

作者信息

Crosslin David R, Tromp Gerard, Burt Amber, Kim Daniel S, Verma Shefali S, Lucas Anastasia M, Bradford Yuki, Crawford Dana C, Armasu Sebastian M, Heit John A, Hayes M Geoffrey, Kuivaniemi Helena, Ritchie Marylyn D, Jarvik Gail P, de Andrade Mariza

机构信息

Division of Medical Genetics, Department of Medicine, University of Washington Seattle, WA, USA ; Department of Genome Sciences, University of Washington Seattle, WA, USA.

The Sigfried and Janet Weis Center for Research, Geisinger Health System Danville, PA, USA.

出版信息

Front Genet. 2014 Nov 4;5:352. doi: 10.3389/fgene.2014.00352. eCollection 2014.

Abstract

Combining samples across multiple cohorts in large-scale scientific research programs is often required to achieve the necessary power for genome-wide association studies. Controlling for genomic ancestry through principal component analysis (PCA) to address the effect of population stratification is a common practice. In addition to local genomic variation, such as copy number variation and inversions, other factors directly related to combining multiple studies, such as platform and site recruitment bias, can drive the correlation patterns in PCA. In this report, we describe the combination and analysis of multi-ethnic cohort with biobanks linked to electronic health records for large-scale genomic association discovery analyses. First, we outline the observed site and platform bias, in addition to ancestry differences. Second, we outline a general protocol for selecting variants for input into the subject variance-covariance matrix, the conventional PCA approach. Finally, we introduce an alternative approach to PCA by deriving components from subject loadings calculated from a reference sample. This alternative approach of generating principal components controlled for site and platform bias, in addition to ancestry differences, has the advantage of fewer covariates and degrees of freedom.

摘要

在大规模科研项目中,通常需要合并多个队列的样本,以获得全基因组关联研究所需的统计学效能。通过主成分分析(PCA)控制基因组祖先信息以解决群体分层效应是一种常见做法。除了局部基因组变异,如拷贝数变异和倒位,其他与合并多项研究直接相关的因素,如平台和位点招募偏差,也会影响PCA中的相关模式。在本报告中,我们描述了多民族队列与与电子健康记录相关的生物样本库的合并及分析,用于大规模基因组关联发现分析。首先,我们概述了观察到的位点和平台偏差以及祖先差异。其次,我们概述了一种选择变体输入到主体方差协方差矩阵的通用方案,即传统的PCA方法。最后,我们介绍了一种PCA的替代方法,通过从参考样本计算的主体负荷中导出成分。这种生成主成分的替代方法除了控制祖先差异外,还控制了位点和平台偏差,具有协变量和自由度较少的优点。

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验