Department of Biostatistics and Center for Statistical Genetics, School of Public Health, University of Michigan, Ann Arbor, MI, USA.
Institute of Genetic Epidemiology, Medical University of Innsbruck, Innsbruck, Austria.
Am J Hum Genet. 2022 Sep 1;109(9):1653-1666. doi: 10.1016/j.ajhg.2022.07.012. Epub 2022 Aug 17.
Understanding the genetic basis of human diseases and traits is dependent on the identification and accurate genotyping of genetic variants. Deep whole-genome sequencing (WGS), the gold standard technology for SNP and indel identification and genotyping, remains very expensive for most large studies. Here, we quantify the extent to which array genotyping followed by genotype imputation can approximate WGS in studies of individuals of African, Hispanic/Latino, and European ancestry in the US and of Finnish ancestry in Finland (a population isolate). For each study, we performed genotype imputation by using the genetic variants present on the Illumina Core, OmniExpress, MEGA, and Omni 2.5M arrays with the 1000G, HRC, and TOPMed imputation reference panels. Using the Omni 2.5M array and the TOPMed panel, ≥90% of bi-allelic single-nucleotide variants (SNVs) are well imputed (r > 0.8) down to minor-allele frequencies (MAFs) of 0.14% in African, 0.11% in Hispanic/Latino, 0.35% in European, and 0.85% in Finnish ancestries. There was little difference in TOPMed-based imputation quality among the arrays with >700k variants. Individual-level imputation quality varied widely between and within the three US studies. Imputation quality also varied across genomic regions, producing regions where even common (MAF > 5%) variants were consistently not well imputed across ancestries. The extent to which array genotyping and imputation can approximate WGS therefore depends on reference panel, genotype array, sample ancestry, and genomic location. Imputation quality by variant or genomic region can be queried with our new tool, RsqBrowser, now deployed on the Michigan Imputation Server.
理解人类疾病和特征的遗传基础依赖于遗传变异的识别和准确基因分型。深度全基因组测序(WGS)是 SNP 和 indel 识别和基因分型的金标准技术,但对于大多数大型研究来说仍然非常昂贵。在这里,我们量化了在对美国非裔、西班牙裔/拉丁裔和欧洲血统个体以及芬兰血统个体(一个人口隔离群体)的研究中,通过基因分型阵列和基因型推断来近似 WGS 的程度。对于每个研究,我们使用 Illumina Core、OmniExpress、MEGA 和 Omni 2.5M 阵列上的遗传变异,并使用 1000G、HRC 和 TOPMed 推断参考面板进行基因型推断。使用 Omni 2.5M 阵列和 TOPMed 面板,≥90%的双等位基因单核苷酸变异(SNV)在非洲、西班牙裔/拉丁裔的次要等位基因频率(MAF)低至 0.14%、欧洲的 0.35%和芬兰的 0.85%时,推断质量良好(r>0.8)。在具有 >700k 变体的阵列中,基于 TOPMed 的推断质量之间几乎没有差异。三个美国研究中的个体水平推断质量差异很大。推断质量也在基因组区域之间存在差异,导致即使是常见(MAF>5%)变体在不同血统中也始终不能很好地推断。因此,基因分型阵列和推断可以近似 WGS 的程度取决于参考面板、基因型阵列、样本血统和基因组位置。可以使用我们的新工具 RsqBrowser 按变体或基因组区域查询推断质量,该工具现在已部署在密歇根推断服务器上。