Sengupta Dhriti, Choudhury Ananyo, Basu Analabha, Ramsay Michèle
Sydney Brenner Institute for Molecular Bioscience, Faculty of Health Sciences, University of the Witwatersrand, Johannesburg, South Africa.
National Institute of Biomedical Genomics, Kalyani, India
Genome Biol Evol. 2016 Dec 31;8(11):3460-3470. doi: 10.1093/gbe/evw244.
Genomic variation in Indian populations is of great interest due to the diversity of ancestral components, social stratification, endogamy and complex admixture patterns. With an expanding population of 1.2 billion, India is also a treasure trove to catalogue innocuous as well as clinically relevant rare mutations. Recent studies have revealed four dominant ancestries in populations from mainland India: Ancestral North-Indian (ANI), Ancestral South-Indian (ASI), Ancestral Tibeto-Burman (ATB) and Ancestral Austro-Asiatic (AAA). The 1000 Genomes Project (KGP) Phase-3 data include about 500 genomes from five linguistically defined Indian-Subcontinent (IS) populations (Punjabi, Gujrati, Bengali, Telugu and Tamil) some of whom are recent migrants to USA or UK. Comparative analyses show that despite the distinct geographic origins of the KGP-IS populations, the ANI component is predominantly represented in this dataset. Previous studies demonstrated population substructure in the HapMap Gujrati population, and we found evidence for additional substructure in the Punjabi and Telugu populations. These substructured populations have characteristic/significant differences in heterozygosity and inbreeding coefficients. Moreover, we demonstrate that the substructure is better explained by factors like differences in proportion of ancestral components, and endogamy driven social structure rather than invoking a novel ancestral component to explain it. Therefore, using language and/or geography as a proxy for an ethnic unit is inadequate for many of the IS populations. This highlights the necessity for more nuanced sampling strategies or corrective statistical approaches, particularly for biomedical and population genetics research in India.
由于祖先成分的多样性、社会分层、近亲结婚以及复杂的混合模式,印度人群的基因组变异备受关注。印度拥有12亿不断增长的人口,也是一个记录无害以及临床相关罕见突变的宝库。最近的研究揭示了印度大陆人群中的四种主要祖先血统:北印度祖先(ANI)、南印度祖先(ASI)、藏缅祖先(ATB)和澳亚祖先(AAA)。千人基因组计划(KGP)第三阶段的数据包括来自五个语言定义的印度次大陆(IS)人群(旁遮普人、古吉拉特人、孟加拉人、泰卢固人和泰米尔人)的约500个基因组,其中一些人是最近移民到美国或英国的。比较分析表明,尽管KGP-IS人群的地理起源不同,但该数据集中主要代表的是ANI成分。先前的研究表明哈普Map古吉拉特人群存在群体亚结构,我们在旁遮普人和泰卢固人群中也发现了额外亚结构的证据。这些亚结构群体在杂合性和近亲繁殖系数方面具有特征性/显著差异。此外,我们证明,亚结构可以更好地用祖先成分比例差异和近亲结婚驱动的社会结构等因素来解释,而不是引入一个新的祖先成分来解释。因此,对于许多IS人群来说,用语言和/或地理作为种族单位的代理是不够的。这凸显了采用更细致入微的抽样策略或校正统计方法的必要性,特别是在印度的生物医学和群体遗传学研究中。