Cahoon Jordan L, Rui Xinyue, Tang Echo, Simons Christopher, Langie Jalen, Chen Minhui, Lo Ying-Chu, Chiang Charleston W K
Center for Genetic Epidemiology, Department of Population and Public Health Sciences, Keck School of Medicine, University of Southern California, Los Angeles, Los Angeles, CA 90033, USA; Department of Quantitative and Computational Biology, University of Southern California, Los Angeles, Los Angeles, CA 90089, USA; Department of Computer Science, University of Southern California, Los Angeles, Los Angeles, CA 90089, USA.
Center for Genetic Epidemiology, Department of Population and Public Health Sciences, Keck School of Medicine, University of Southern California, Los Angeles, Los Angeles, CA 90033, USA.
Am J Hum Genet. 2024 May 2;111(5):979-989. doi: 10.1016/j.ajhg.2024.03.011. Epub 2024 Apr 10.
Genotype imputation is now fundamental for genome-wide association studies but lacks fairness due to the underrepresentation of references from non-European ancestries. The state-of-the-art imputation reference panel released by the Trans-Omics for Precision Medicine (TOPMed) initiative improved the imputation of admixed African-ancestry and Hispanic/Latino samples, but imputation for populations primarily residing outside of North America may still fall short in performance due to persisting underrepresentation. To illustrate this point, we imputed the genotypes of over 43,000 individuals across 123 populations around the world and identified numerous populations where imputation accuracy paled in comparison to that of European-ancestry populations. For instance, the mean imputation r-squared (Rsq) for variants with minor allele frequencies between 1% and 5% in Saudi Arabians (n = 1,061), Vietnamese (n = 1,264), Thai (n = 2,435), and Papua New Guineans (n = 776) were 0.79, 0.78, 0.76, and 0.62, respectively, compared to 0.90-0.93 for comparable European populations matched in sample size and SNP array content. Outside of Africa and Latin America, Rsq appeared to decrease as genetic distances to European-ancestry reference increased, as predicted. Using sequencing data as ground truth, we also showed that Rsq may over-estimate imputation accuracy for non-European populations more than European populations, suggesting further disparity in accuracy between populations. Using 1,496 sequenced individuals from Taiwan Biobank as a second reference panel to TOPMed, we also assessed a strategy to improve imputation for non-European populations with meta-imputation, but this design did not improve accuracy across frequency spectra. Taken together, our analyses suggest that we must ultimately strive to increase diversity and size to promote equity within genetics research.
基因型填充如今已成为全基因组关联研究的基础,但由于非欧洲血统参考数据的代表性不足,该方法缺乏公平性。精准医学跨组学(TOPMed)计划发布的最新填充参考面板改善了对非洲血统混合样本和西班牙裔/拉丁裔样本的填充效果,但对于主要居住在北美以外地区的人群,由于代表性持续不足,其填充性能可能仍有欠缺。为说明这一点,我们对全球123个群体中超过43,000人的基因型进行了填充,并识别出许多群体,其填充准确性与欧洲血统群体相比显得逊色。例如,沙特阿拉伯人(n = 1,061)、越南人(n = 1,264)、泰国人(n = 2,435)和巴布亚新几内亚人(n = 776)中,次要等位基因频率在1%至5%之间的变异的平均填充r平方(Rsq)分别为0.79、0.78、0.76和0.62,而样本量和单核苷酸多态性(SNP)阵列内容匹配的可比欧洲群体的这一数值为0.90 - 0.93。正如预期的那样,在非洲和拉丁美洲以外地区,随着与欧洲血统参考的遗传距离增加,Rsq似乎会降低。以测序数据作为基准事实,我们还表明,Rsq对非欧洲群体填充准确性的高估可能超过欧洲群体,这表明不同群体之间在准确性上存在进一步差异。我们还使用来自台湾生物银行的1,496名测序个体作为TOPMed的第二个参考面板,评估了一种通过元填充来改善非欧洲群体填充的策略,但该设计并未在整个频率谱上提高准确性。综合来看,我们的分析表明,我们最终必须努力增加多样性和样本量,以促进遗传学研究中的公平性。