Márquez-Luna Carla, Loh Po-Ru, Price Alkes L
Department of Biostatistics, Harvard T. H. Chan School of Public Health, Boston, Massachusetts, United States of America.
Department of Epidemiology, Harvard T. H. Chan School of Public Health, Boston, Massachusetts, United States of America.
Genet Epidemiol. 2017 Dec;41(8):811-823. doi: 10.1002/gepi.22083. Epub 2017 Nov 7.
Methods for genetic risk prediction have been widely investigated in recent years. However, most available training data involves European samples, and it is currently unclear how to accurately predict disease risk in other populations. Previous studies have used either training data from European samples in large sample size or training data from the target population in small sample size, but not both. Here, we introduce a multiethnic polygenic risk score that combines training data from European samples and training data from the target population. We applied this approach to predict type 2 diabetes (T2D) in a Latino cohort using both publicly available European summary statistics in large sample size (N = 40k) and Latino training data in small sample size (N = 8k). Here, we attained a >70% relative improvement in prediction accuracy (from R = 0.027 to 0.047) compared to methods that use only one source of training data, consistent with large relative improvements in simulations. We observed a systematically lower load of T2D risk alleles in Latino individuals with more European ancestry, which could be explained by polygenic selection in ancestral European and/or Native American populations. We predict T2D in a South Asian UK Biobank cohort using European (N = 40k) and South Asian (N = 16k) training data and attained a >70% relative improvement in prediction accuracy, and application to predict height in an African UK Biobank cohort using European (N = 113k) and African (N = 2k) training data attained a 30% relative improvement. Our work reduces the gap in polygenic risk prediction accuracy between European and non-European target populations.
近年来,遗传风险预测方法得到了广泛研究。然而,大多数现有的训练数据都涉及欧洲样本,目前尚不清楚如何准确预测其他人群的疾病风险。以往的研究要么使用大样本量的欧洲样本训练数据,要么使用小样本量的目标人群训练数据,但没有同时使用两者。在此,我们引入了一种多民族多基因风险评分,它结合了欧洲样本的训练数据和目标人群的训练数据。我们将这种方法应用于一个拉丁裔队列中2型糖尿病(T2D)的预测,使用了大样本量(N = 40k)的公开可用欧洲汇总统计数据和小样本量(N = 8k)的拉丁裔训练数据。在此,与仅使用一种训练数据来源的方法相比,我们在预测准确性上实现了超过70%的相对提升(从R = 0.027提高到0.047),这与模拟中的大幅相对提升一致。我们观察到欧洲血统较多的拉丁裔个体中T2D风险等位基因的负荷系统性较低,这可以用欧洲祖先和/或美洲原住民群体中的多基因选择来解释。我们使用欧洲(N = 40k)和南亚(N = 16k)训练数据在一个英国生物银行南亚队列中预测T2D,并在预测准确性上实现了超过70%的相对提升,而在一个英国生物银行非洲队列中使用欧洲(N = 113k)和非洲(N = 2k)训练数据预测身高则实现了30%的相对提升。我们的工作缩小了欧洲和非欧洲目标人群在多基因风险预测准确性方面的差距。