Suppr超能文献

全球人类群体的插补准确性。

Imputation Accuracy Across Global Human Populations.

作者信息

Cahoon Jordan L, Rui Xinyue, Tang Echo, Simons Christopher, Langie Jalen, Chen Minhui, Lo Ying-Chu, Chiang Charleston W K

机构信息

Center for Genetic Epidemiology, Department of Population and Public Health Sciences, Keck School of Medicine, University of Southern California, Los Angeles, CA 90033, USA.

Department of Quantitative and Computational Biology, University of Southern California, Los Angeles, CA 90089, USA.

出版信息

bioRxiv. 2023 Oct 26:2023.05.22.541241. doi: 10.1101/2023.05.22.541241.

Abstract

Genotype imputation is now fundamental for genome-wide association studies but lacks fairness due to the underrepresentation of populations with non-European ancestries. The state-of-the-art imputation reference panel released by the Trans-Omics for Precision Medicine (TOPMed) initiative contains a substantial number of admixed African-ancestry and Hispanic/Latino samples to impute these populations with nearly the same accuracy as European-ancestry cohorts. However, imputation for populations primarily residing outside of North America may still fall short in performance due to persisting underrepresentation. To illustrate this point, we curated genome-wide array data from 23 publications published between 2008 to 2021. In total, we imputed over 43k individuals across 123 populations around the world. We identified a number of populations where imputation accuracy paled in comparison to that of European-ancestry populations. For instance, the mean imputation r-squared (Rsq) for 1-5% alleles in Saudi Arabians (N=1061), Vietnamese (N=1264), Thai (N=2435), and Papua New Guineans (N=776) were 0.79, 0.78, 0.76, and 0.62, respectively. In contrast, the mean Rsq ranged from 0.90 to 0.93 for comparable European populations matched in sample size and SNP content. Outside of Africa and Latin America, Rsq appeared to decrease as genetic distances to European reference increased, as predicted. Further analysis using sequencing data as ground truth suggested that imputation software may over-estimate imputation accuracy for non-European populations than European populations, suggesting further disparity between populations. Using 1496 whole genome sequenced individuals from Taiwan Biobank as a reference, we also assessed a strategy to improve imputation for non-European populations with meta-imputation, which can combine results from TOPMed with smaller population-specific reference panels. We found that meta-imputation in this design did not improve Rsq genome-wide. Taken together, our analysis suggests that with the current size of alternative reference panels, meta-imputation alone cannot improve imputation efficacy for underrepresented cohorts and we must ultimately strive to increase diversity and size to promote equity within genetics research.

摘要

基因型填充目前是全基因组关联研究的基础,但由于非欧洲血统人群的代表性不足而缺乏公平性。精准医学跨组学(TOPMed)计划发布的最新填充参考面板包含大量非洲血统和西班牙裔/拉丁裔混合样本,用于填充这些人群的准确率几乎与欧洲血统队列相同。然而,由于代表性仍然不足,主要居住在北美以外地区人群的填充性能可能仍然较差。为了说明这一点,我们整理了2008年至2021年间发表的23篇出版物中的全基因组阵列数据。我们总共对全球123个人群中的4.3万多人进行了基因型填充。我们发现,与欧洲血统人群相比,有许多人群的填充准确率明显较低。例如,沙特阿拉伯人(N = 1061)、越南人(N = 1264)、泰国人(N = 2435)和巴布亚新几内亚人(N = 776)中1-5%等位基因的平均填充r平方(Rsq)分别为0.79、0.78、0.76和0.62。相比之下,样本量和单核苷酸多态性(SNP)含量匹配的欧洲可比人群的平均Rsq范围为0.90至0.93。正如预测的那样,在非洲和拉丁美洲以外地区,随着与欧洲参考人群遗传距离的增加,Rsq似乎会下降。使用测序数据作为真实情况的进一步分析表明,填充软件对非欧洲人群的填充准确率可能比欧洲人群高估,这表明不同人群之间存在进一步的差异。我们还使用台湾生物银行的1496名全基因组测序个体作为参考,评估了一种通过元填充来提高非欧洲人群填充准确率的策略,该策略可以将TOPMed的结果与较小的特定人群参考面板相结合。我们发现这种设计中的元填充并没有在全基因组范围内提高Rsq。综上所述,我们的分析表明,以目前替代参考面板的规模,仅靠元填充无法提高代表性不足队列的填充效率,我们最终必须努力增加多样性和规模,以促进遗传学研究的公平性。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0ddb/10602412/d84c395792f9/nihpp-2023.05.22.541241v2-f0001.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验