检测群体结构所需的单核苷酸多态性（SNP）位点数量。

Number of SNPS loci needed to detect population structure.

作者信息

Turakulov Rust, Easteal Simon

机构信息

John Curtin School of Medical Research, Human Genetics Group and Centre for Bioinformatics Science, Australian National University, Canberra, ACT, Australia.

出版信息

Hum Hered. 2003;55(1):37-45. doi: 10.1159/000071808.

DOI:10.1159/000071808

PMID:12890924

Abstract

The study of the association of polymorphic genetic markers with common diseases is one of the most powerful tools in modern genetics. Interest in single nucleotide polymorphisms (SNPs) has steadily grown over the last decade. SNPs are currently the most developed markers in the human genome because they have a number of advantages over other marker types. One of the critical problems responsible for 'spurious' association findings in case-control studies is population stratification. There are many statistical approaches developed for detecting population heterogeneity. However the power to detect population structure by known methods is highly dependent on the number of loci utilised. We performed an analysis of SNPs data available in the public domain from The Single Nucleotide Consortia Ltd. (TSCL). Three populations, Afro-American, Asian and Caucasian, were compared. Estimation of the minimum number of SNPs loci necessary for detection of the population structure was performed. Two clustering approaches, distance-based and model-based, were compared. The model-based approach was superior when compared with the distance-based method. We found more than 65 random SNPs loci are required for identifying distinct geographically separated populations. Increasing the number of markers to over 100 raises the probability of correct assignment of a particular individual to an origin group to over 90%, even with conventional clustering methods.

摘要

多态性基因标记与常见疾病关联的研究是现代遗传学中最强大的工具之一。在过去十年里，对单核苷酸多态性（SNP）的关注持续增加。SNP是目前人类基因组中最成熟的标记，因为它们相较于其他标记类型具有诸多优势。病例对照研究中导致“虚假”关联结果的关键问题之一是群体分层。有许多用于检测群体异质性的统计方法。然而，用已知方法检测群体结构的能力高度依赖于所使用的基因座数量。我们对来自单核苷酸联盟有限公司（TSCL）公开领域的SNP数据进行了分析。比较了非裔美国人、亚洲人和高加索人这三个人群。对检测群体结构所需的最小SNP基因座数量进行了估计。比较了两种聚类方法，即基于距离的方法和基于模型的方法。与基于距离的方法相比，基于模型的方法更具优势。我们发现识别地理上不同的群体需要超过65个随机SNP基因座。即使使用传统聚类方法，将标记数量增加到100以上也会使将特定个体正确分配到起源群体的概率提高到90%以上。