在与电子健康记录相关联的eMERGE多机构生物样本库中控制群体结构和基因分型平台偏差。

Controlling for population structure and genotyping platform bias in the eMERGE multi-institutional biobank linked to electronic health records.

作者信息

Crosslin David R, Tromp Gerard, Burt Amber, Kim Daniel S, Verma Shefali S, Lucas Anastasia M, Bradford Yuki, Crawford Dana C, Armasu Sebastian M, Heit John A, Hayes M Geoffrey, Kuivaniemi Helena, Ritchie Marylyn D, Jarvik Gail P, de Andrade Mariza

机构信息

Division of Medical Genetics, Department of Medicine, University of Washington Seattle, WA, USA ; Department of Genome Sciences, University of Washington Seattle, WA, USA.

The Sigfried and Janet Weis Center for Research, Geisinger Health System Danville, PA, USA.

出版信息

Front Genet. 2014 Nov 4;5:352. doi: 10.3389/fgene.2014.00352. eCollection 2014.

DOI:10.3389/fgene.2014.00352

PMID:25414722

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC4220165/

Abstract

Combining samples across multiple cohorts in large-scale scientific research programs is often required to achieve the necessary power for genome-wide association studies. Controlling for genomic ancestry through principal component analysis (PCA) to address the effect of population stratification is a common practice. In addition to local genomic variation, such as copy number variation and inversions, other factors directly related to combining multiple studies, such as platform and site recruitment bias, can drive the correlation patterns in PCA. In this report, we describe the combination and analysis of multi-ethnic cohort with biobanks linked to electronic health records for large-scale genomic association discovery analyses. First, we outline the observed site and platform bias, in addition to ancestry differences. Second, we outline a general protocol for selecting variants for input into the subject variance-covariance matrix, the conventional PCA approach. Finally, we introduce an alternative approach to PCA by deriving components from subject loadings calculated from a reference sample. This alternative approach of generating principal components controlled for site and platform bias, in addition to ancestry differences, has the advantage of fewer covariates and degrees of freedom.

摘要

在大规模科研项目中，通常需要合并多个队列的样本，以获得全基因组关联研究所需的统计学效能。通过主成分分析（PCA）控制基因组祖先信息以解决群体分层效应是一种常见做法。除了局部基因组变异，如拷贝数变异和倒位，其他与合并多项研究直接相关的因素，如平台和位点招募偏差，也会影响PCA中的相关模式。在本报告中，我们描述了多民族队列与与电子健康记录相关的生物样本库的合并及分析，用于大规模基因组关联发现分析。首先，我们概述了观察到的位点和平台偏差以及祖先差异。其次，我们概述了一种选择变体输入到主体方差协方差矩阵的通用方案，即传统的PCA方法。最后，我们介绍了一种PCA的替代方法，通过从参考样本计算的主体负荷中导出成分。这种生成主成分的替代方法除了控制祖先差异外，还控制了位点和平台偏差，具有协变量和自由度较少的优点。

相似文献

Controlling for population structure and genotyping platform bias in the eMERGE multi-institutional biobank linked to electronic health records.在与电子健康记录相关联的eMERGE多机构生物样本库中控制群体结构和基因分型平台偏差。

Front Genet. 2014 Nov 4;5:352. doi: 10.3389/fgene.2014.00352. eCollection 2014.

Local PCA Shows How the Effect of Population Structure Differs Along the Genome.局部主成分分析显示群体结构效应对基因组的影响如何存在差异。

Genetics. 2019 Jan;211(1):289-304. doi: 10.1534/genetics.118.301747. Epub 2018 Nov 20.

Evaluation of methods for adjusting population stratification in genome-wide association studies: Standard versus categorical principal component analysis.全基因组关联研究中调整群体分层方法的评估：标准主成分分析与分类主成分分析

Ann Hum Genet. 2019 Nov;83(6):454-464. doi: 10.1111/ahg.12339. Epub 2019 Jul 19.

Large-Scale Genomic Biobanks and Cardiovascular Disease.大规模基因组生物库与心血管疾病。

Curr Cardiol Rep. 2018 Mar 8;20(4):22. doi: 10.1007/s11886-018-0969-8.

Conducting a large, multi-site survey about patients' views on broad consent: challenges and solutions.开展一项关于患者对广泛同意看法的大型多地点调查：挑战与解决方案。

BMC Med Res Methodol. 2016 Nov 24;16(1):162. doi: 10.1186/s12874-016-0263-7.

eMERGE Phenome-Wide Association Study (PheWAS) identifies clinical associations and pleiotropy for stop-gain variants.电子医疗记录与基因组学（eMERGE）全表型组关联研究（PheWAS）确定了截短增益变异的临床关联和多效性。

BMC Med Genomics. 2016 Aug 12;9 Suppl 1(Suppl 1):32. doi: 10.1186/s12920-016-0191-8.

Imputation and quality control steps for combining multiple genome-wide datasets.合并多个全基因组数据集的插补和质量控制步骤。

Front Genet. 2014 Dec 11;5:370. doi: 10.3389/fgene.2014.00370. eCollection 2014.

GRAF-pop: A Fast Distance-Based Method To Infer Subject Ancestry from Multiple Genotype Datasets Without Principal Components Analysis.GRAF-pop：一种无需主成分分析即可基于距离推断个体祖先的快速方法，适用于多种基因型数据集。

G3 (Bethesda). 2019 Aug 8;9(8):2447-2461. doi: 10.1534/g3.118.200925.

Estimation of Genetic Relationships Between Individuals Across Cohorts and Platforms: Application to Childhood Height.跨队列和平台的个体间遗传关系估计：在儿童身高方面的应用。

Behav Genet. 2015 Sep;45(5):514-28. doi: 10.1007/s10519-015-9725-7. Epub 2015 Jun 3.

Phenome-wide association study (PheWAS) in EMR-linked pediatric cohorts, genetically links PLCL1 to speech language development and IL5-IL13 to Eosinophilic Esophagitis.在与电子病历相关的儿科队列中进行的全表型组关联研究（PheWAS），从基因层面将PLCL1与语言发育联系起来，并将IL5 - IL13与嗜酸性粒细胞性食管炎联系起来。

Front Genet. 2014 Nov 18;5:401. doi: 10.3389/fgene.2014.00401. eCollection 2014.

引用本文的文献

Generalizability of Polygenic Risk Scores for Breast Cancer Among Women With European, African, and Latinx Ancestry.多基因风险评分在具有欧洲、非洲和拉丁裔血统的女性乳腺癌中的可推广性。

JAMA Netw Open. 2021 Aug 2;4(8):e2119084. doi: 10.1001/jamanetworkopen.2021.19084.

Single nucleotide variations in ZBTB46 are associated with post-thrombolytic parenchymal haematoma.ZBTB46 中的单核苷酸变异与溶栓后实质血肿有关。

Brain. 2021 Sep 4;144(8):2416-2426. doi: 10.1093/brain/awab090.

Effects of marker type and filtering criteria on - comparisons.标记类型和过滤标准对 - 比较的影响。需注意，原文中“-”处内容缺失，可能影响完整理解。

R Soc Open Sci. 2019 Nov 6;6(11):190666. doi: 10.1098/rsos.190666. eCollection 2019 Nov.

The eMERGE genotype set of 83,717 subjects imputed to ~40 million variants genome wide and association with the herpes zoster medical record phenotype.对83717名受试者进行全基因组范围内约4000万个变异的推算，得到eMERGE基因型集，并与带状疱疹医疗记录表型进行关联分析。

Genet Epidemiol. 2019 Feb;43(1):63-81. doi: 10.1002/gepi.22167. Epub 2018 Oct 8.

Self-reported race and ethnicity of US biobank participants compared to the US Census.将美国生物样本库参与者自我报告的种族和族裔与美国人口普查数据进行比较。

J Community Genet. 2017 Jul;8(3):229-238. doi: 10.1007/s12687-017-0308-6. Epub 2017 Jun 16.

Performance of an electronic health record-based phenotype algorithm to identify community associated methicillin-resistant Staphylococcus aureus cases and controls for genetic association studies.基于电子健康记录的表型算法在识别社区获得性耐甲氧西林金黄色葡萄球菌病例及用于遗传关联研究的对照中的性能。

BMC Infect Dis. 2016 Nov 17;16(1):684. doi: 10.1186/s12879-016-2020-2.

Population Genomics and the Statistical Values of Race: An Interdisciplinary Perspective on the Biological Classification of Human Populations and Implications for Clinical Genetic Epidemiological Research.群体基因组学与种族的统计价值：关于人类群体生物分类及临床遗传流行病学研究意义的跨学科视角

Front Genet. 2016 Feb 17;7:22. doi: 10.3389/fgene.2016.00022. eCollection 2016.

The phenotypic legacy of admixture between modern humans and Neandertals.现代人类与尼安德特人混合后的表型遗产。

Science. 2016 Feb 12;351(6274):737-41. doi: 10.1126/science.aad2149.

The foundation of precision medicine: integration of electronic health records with genomics through basic, clinical, and translational research.精准医学的基础：通过基础、临床和转化研究将电子健康记录与基因组学相结合。

Front Genet. 2015 Mar 17;6:104. doi: 10.3389/fgene.2015.00104. eCollection 2015.

Imputation and quality control steps for combining multiple genome-wide datasets.合并多个全基因组数据集的插补和质量控制步骤。

Front Genet. 2014 Dec 11;5:370. doi: 10.3389/fgene.2014.00370. eCollection 2014.

本文引用的文献

eMERGEing progress in genomics-the first seven years.基因组学的新兴进展——前七年。

Front Genet. 2014 Jun 17;5:184. doi: 10.3389/fgene.2014.00184. eCollection 2014.

The Electronic Medical Records and Genomics (eMERGE) Network: past, present, and future.电子病历与基因组学（eMERGE）网络：过去、现在和未来。

Genet Med. 2013 Oct;15(10):761-71. doi: 10.1038/gim.2013.72. Epub 2013 Jun 6.

Improved whole-chromosome phasing for disease and population genetic studies.用于疾病和群体遗传学研究的改进全染色体定相技术。

Nat Methods. 2013 Jan;10(1):5-6. doi: 10.1038/nmeth.2307.

A high-performance computing toolset for relatedness and principal component analysis of SNP data.用于 SNP 数据亲缘关系和主成分分析的高性能计算工具集。

Bioinformatics. 2012 Dec 15;28(24):3326-8. doi: 10.1093/bioinformatics/bts606. Epub 2012 Oct 11.

Fast and accurate genotype imputation in genome-wide association studies through pre-phasing.通过预分组实现全基因组关联研究中的快速准确基因型推断。

Nat Genet. 2012 Jul 22;44(8):955-9. doi: 10.1038/ng.2354.

Is 'forward' the same as 'plus'?…and other adventures in SNP allele nomenclature.“正向”与“加”相同吗？……以及单核苷酸多态性等位基因命名法中的其他问题

Trends Genet. 2012 Aug;28(8):361-3. doi: 10.1016/j.tig.2012.05.002. Epub 2012 Jun 2.

Population structure of Hispanics in the United States: the multi-ethnic study of atherosclerosis.美国西班牙裔人群的人口结构：动脉粥样硬化的多种族研究。

PLoS Genet. 2012;8(4):e1002640. doi: 10.1371/journal.pgen.1002640. Epub 2012 Apr 12.

Genotype imputation with thousands of genomes.使用数千份基因组进行基因型推断。

G3 (Bethesda). 2011 Nov;1(6):457-70. doi: 10.1534/g3.111.001198. Epub 2011 Nov 1.

The use of race, ethnicity and ancestry in human genetic research.种族、族裔和血统在人类基因研究中的应用。

Hugo J. 2011 Dec;5(1-4):47-63. doi: 10.1007/s11568-011-9154-5. Epub 2011 Jul 7.

Assessing the accuracy of observer-reported ancestry in a biorepository linked to electronic medical records.评估与电子病历关联的生物库中观察者报告的种族准确性。

Genet Med. 2010 Oct;12(10):648-50. doi: 10.1097/GIM.0b013e3181efe2df.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

在与电子健康记录相关联的eMERGE多机构生物样本库中控制群体结构和基因分型平台偏差。

Controlling for population structure and genotyping platform bias in the eMERGE multi-institutional biobank linked to electronic health records.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献