Department of Neurology, University of Pennsylvania Perelman School of Medicine, Philadelphia, PA 19104, USA.
Epilepsy Research Centre, Department of Medicine, University of Melbourne, Austin Health, Heidelberg, VIC 3084, Australia; Population Health and Immunity Division, the Walter and Eliza Hall Institute of Medical Research, Parkville, VIC 3052, Australia; Department of Medical Biology, University of Melbourne, Melbourne, VIC 3052, Australia.
Am J Hum Genet. 2024 Sep 5;111(9):1805-1809. doi: 10.1016/j.ajhg.2024.07.014. Epub 2024 Aug 20.
Polygenic risk scores (PRSs) are an important tool for understanding the role of common genetic variants in human disease. Standard best practices recommend that PRSs be analyzed in cohorts that are independent of the genome-wide association study (GWAS) used to derive the scores without sample overlap or relatedness between the two cohorts. However, identifying sample overlap and relatedness can be challenging in an era of GWASs performed by large biobanks and international research consortia. Although most genomics researchers are aware of best practices and theoretical concerns about sample overlap and relatedness between GWAS and PRS cohorts, the prevailing assumption is that the risk of bias is small for very large GWASs. Here, we present two real-world examples demonstrating that sample overlap and relatedness is not a minor or theoretical concern but an important potential source of bias in PRS studies. Using a recently developed statistical adjustment tool, we found that excluding overlapping and related samples was equal to or more powerful than adjusting for overlap bias. Our goal is to make genomics researchers aware of the magnitude of risk of bias from sample overlap and relatedness and to highlight the need for mitigation tools, including independent validation cohorts in PRS studies, continued development of statistical adjustment methods, and tools for researchers to test their cohorts for overlap and relatedness with GWAS cohorts without sharing individual-level data.
多基因风险评分 (PRSs) 是理解常见遗传变异在人类疾病中的作用的重要工具。标准最佳实践建议,在与用于推导评分的全基因组关联研究 (GWAS) 无样本重叠或相关性的独立队列中分析 PRS。然而,在由大型生物库和国际研究联盟进行的 GWAS 时代,识别样本重叠和相关性可能具有挑战性。尽管大多数基因组学研究人员都意识到最佳实践以及关于 GWAS 和 PRS 队列之间样本重叠和相关性的理论问题,但普遍的假设是,对于非常大的 GWAS,偏倚风险很小。在这里,我们提出了两个实际示例,证明样本重叠和相关性不是一个次要的或理论上的问题,而是 PRS 研究中一个重要的潜在偏倚来源。使用最近开发的统计调整工具,我们发现排除重叠和相关样本与调整重叠偏倚一样有效或更有效。我们的目标是让基因组学研究人员意识到样本重叠和相关性带来的偏倚风险的程度,并强调需要缓解工具,包括在 PRS 研究中使用独立验证队列、继续开发统计调整方法,以及为研究人员提供无需共享个人层面数据即可测试其队列与 GWAS 队列之间重叠和相关性的工具。