Department of Biostatistics and Epidemiology, University of Pennsylvania School of Medicine, Philadelphia, PA 19104, USA.
Bioinformatics. 2010 Mar 15;26(6):798-806. doi: 10.1093/bioinformatics/btq025. Epub 2010 Jan 22.
The rapid development of genotyping technology and extensive cataloguing of single nucleotide polymorphisms (SNPs) across the human genome have made genetic association studies the mainstream for gene mapping of complex human diseases. For many diseases, the most practical approach is the population-based design with unrelated individuals. Although having the advantages of easier sample collection and greater power than family-based designs, unrecognized population stratification in the study samples can lead to both false-positive and false-negative findings and might obscure the true association signals if not appropriately corrected.
We report PHYLOSTRAT, a new method that corrects for population stratification by combining phylogeny constructed from SNP genotypes and principal coordinates from multi-dimensional scaling (MDS) analysis. This hybrid approach efficiently captures both discrete and admixed population structures.
By extensive simulations, the analysis of a synthetic genome-wide association dataset created using data from the Human Genome Diversity Project, and the analysis of a lactase-height dataset, we show that our method can correct for population stratification more efficiently than several existing population stratification correction methods, including EIGENSTRAT, a hybrid approach based on MDS and clustering, and STRATSCORE , in terms of requiring fewer random SNPs for inference of population structure. By combining the flexibility and hierarchical nature of phylogenetic trees with the advantage of representing admixture using MDS, our hybrid approach can capture the complex population structures in human populations effectively.
Codes can be downloaded from http://people.pcbi.upenn.edu/ approximately lswang/phylostrat/
mingyao@upenn.edu; iswang@upenn.edu.
Supplementary data are available at Bioinformatics online.
基因分型技术的快速发展和人类基因组中单核苷酸多态性 (SNP) 的广泛编目使得遗传关联研究成为复杂人类疾病基因映射的主流。对于许多疾病,最实用的方法是基于人群的设计,使用无关个体。尽管基于人群的设计具有比基于家庭的设计更容易收集样本和更大的优势,但在研究样本中未被识别的群体分层可能导致假阳性和假阴性结果,如果不适当纠正,可能会掩盖真实的关联信号。
我们报告了 PHYLOSTRAT,这是一种新的方法,通过结合从 SNP 基因型构建的系统发育和多维尺度分析 (MDS) 分析的主坐标来纠正群体分层。这种混合方法有效地捕获了离散和混合的群体结构。
通过广泛的模拟、使用人类基因组多样性计划数据创建的合成全基因组关联数据集的分析以及乳糖酶高度数据集的分析,我们表明我们的方法可以比几种现有的群体分层校正方法更有效地校正群体分层,包括基于 MDS 和聚类的混合方法 EIGENSTRAT 和 STRATSCORE ,在推断群体结构时需要更少的随机 SNP。通过将系统发育树的灵活性和分层性质与使用 MDS 表示混合的优势相结合,我们的混合方法可以有效地捕捉人类群体中的复杂群体结构。
代码可从 http://people.pcbi.upenn.edu/ 下载,大约 lswang/phylostrat/
mingyao@upenn.edu; iswang@upenn.edu。
补充数据可在生物信息学在线获得。