Chinese Academy of Sciences and Max Planck Society Partner Institute for Computational Biology, Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences, Shanghai 200031, PR China.
Am J Hum Genet. 2009 Dec;85(6):762-74. doi: 10.1016/j.ajhg.2009.10.015.
To date, most genome-wide association studies (GWAS) and studies of fine-scale population structure have been conducted primarily on Europeans. Han Chinese, the largest ethnic group in the world, composing 20% of the entire global human population, is largely underrepresented in such studies. A well-recognized challenge is the fact that population structure can cause spurious associations in GWAS. In this study, we examined population substructures in a diverse set of over 1700 Han Chinese samples collected from 26 regions across China, each genotyped at approximately 160K single-nucleotide polymorphisms (SNPs). Our results showed that the Han Chinese population is intricately substructured, with the main observed clusters corresponding roughly to northern Han, central Han, and southern Han. However, simulated case-control studies showed that genetic differentiation among these clusters, although very small (F(ST) = 0.0002 approximately 0.0009), is sufficient to lead to an inflated rate of false-positive results even when the sample size is moderate. The top two SNPs with the greatest frequency differences between the northern Han and southern Han clusters (F(ST) > 0.06) were found in the FADS2 gene, which associates with the fatty acid composition in phospholipids, and in the HLA complex P5 gene (HCP5), which associates with HIV infection, psoriasis, and psoriatic arthritis. Ingenuity Pathway Analysis (IPA) showed that most differentiated genes among clusters are involved in cardiac arteriopathy (p < 10(-101)). These signals indicating significant differences among Han Chinese subpopulations should be carefully explained in case they are also detected in association studies, especially when sample sources are diverse.
迄今为止,大多数全基因组关联研究(GWAS)和精细人群结构研究主要是在欧洲人群中进行的。汉族是世界上最大的民族,占全球总人口的 20%,但在这些研究中代表性严重不足。一个众所周知的挑战是,人群结构可能导致 GWAS 中出现虚假关联。在这项研究中,我们研究了来自中国 26 个地区的 1700 多个汉族样本的人群亚结构,每个样本大约有 160K 个单核苷酸多态性(SNP)进行了基因分型。我们的研究结果表明,汉族人群结构错综复杂,主要观察到的聚类大致对应于北方汉族、中部汉族和南方汉族。然而,模拟病例对照研究表明,尽管这些聚类之间的遗传分化很小(F(ST)= 0.0002 约 0.0009),但足以导致假阳性结果的发生率过高,即使样本量适中。在北方汉族和南方汉族聚类之间频率差异最大的前两个 SNP(F(ST)> 0.06)位于 FADS2 基因和 HLA 复合物 P5 基因(HCP5)中,FADS2 基因与磷脂中的脂肪酸组成有关,HCP5 基因与 HIV 感染、银屑病和银屑病关节炎有关。Ingenuity Pathway Analysis(IPA)显示,聚类之间差异最大的基因主要涉及心脏动脉病变(p < 10(-101))。这些汉族亚群之间存在显著差异的信号在关联研究中也应该被仔细解释,尤其是当样本来源多样化时。