Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, MA 02115, USA.
Department of Biostatistics, University of North Carolina at Chapel Hill Gillings School of Global Public Health, Chapel Hill, NC 27599, USA.
Am J Hum Genet. 2024 Oct 3;111(10):2129-2138. doi: 10.1016/j.ajhg.2024.08.018. Epub 2024 Sep 12.
Large-scale, multi-ethnic whole-genome sequencing (WGS) studies, such as the National Human Genome Research Institute Genome Sequencing Program's Centers for Common Disease Genomics (CCDG), play an important role in increasing diversity for genetic research. Before performing association analyses, assessing Hardy-Weinberg equilibrium (HWE) is a crucial step in quality control procedures to remove low quality variants and ensure valid downstream analyses. Diverse WGS studies contain ancestrally heterogeneous samples; however, commonly used HWE methods assume that the samples are homogeneous. Therefore, directly applying these to the whole dataset can yield statistically invalid results. To account for this heterogeneity, HWE can be tested on subsets of samples that have genetically homogeneous ancestries and the results aggregated at each variant. To facilitate valid HWE subset testing, we developed a semi-supervised learning approach that predicts homogeneous ancestries based on the genotype. This method provides a convenient tool for estimating HWE in the presence of population structure and missing self-reported race and ethnicities in diverse WGS studies. In addition, assessing HWE within the homogeneous ancestries provides reliable HWE estimates that will directly benefit downstream analyses, including association analyses in WGS studies. We applied our proposed method on the CCDG dataset, predicting homogeneous genetic ancestry groups for 60,545 multi-ethnic WGS samples to assess HWE within each group.
大规模的、多民族的全基因组测序(WGS)研究,如国家人类基因组研究所基因组测序计划的常见疾病基因组学中心(CCDG),在增加遗传研究的多样性方面发挥着重要作用。在进行关联分析之前,评估哈迪-温伯格平衡(HWE)是质量控制程序中的一个关键步骤,以去除低质量的变体并确保有效的下游分析。多样化的 WGS 研究包含了具有不同祖先的样本;然而,常用的 HWE 方法假设样本是同质的。因此,直接将这些方法应用于整个数据集可能会产生统计上无效的结果。为了考虑这种异质性,可以在具有遗传同质祖先的样本子集中测试 HWE,并在每个变体处汇总结果。为了方便进行有效的 HWE 子集测试,我们开发了一种基于基因型的半监督学习方法来预测同质祖先。该方法为存在群体结构和多样化的 WGS 研究中缺失的自我报告种族和民族的情况下进行 HWE 估计提供了一种便捷的工具。此外,在同质祖先中评估 HWE 可以提供可靠的 HWE 估计值,这将直接受益于下游分析,包括 WGS 研究中的关联分析。我们在 CCDG 数据集上应用了我们提出的方法,预测了 60545 个多民族 WGS 样本的同质遗传祖先群体,以评估每个群体中的 HWE。