利用稀疏拉普拉斯特征函数进行祖先信息标记选择和群体结构可视化。

Ancestral informative marker selection and population structure visualization using sparse Laplacian eigenfunctions.

机构信息

Department of Radiology, The University of Chicago, Chicago, Illinois, United States of America.

出版信息

PLoS One. 2010 Nov 4;5(11):e13734. doi: 10.1371/journal.pone.0013734.

DOI:10.1371/journal.pone.0013734

PMID:21079796

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC2973949/

Abstract

Identification of a small panel of population structure informative markers can reduce genotyping cost and is useful in various applications, such as ancestry inference in association mapping, forensics and evolutionary theory in population genetics. Traditional methods to ascertain ancestral informative markers usually require the prior knowledge of individual ancestry and have difficulty for admixed populations. Recently Principal Components Analysis (PCA) has been employed with success to select SNPs which are highly correlated with top significant principal components (PCs) without use of individual ancestral information. The approach is also applicable to admixed populations. Here we propose a novel approach based on our recent result on summarizing population structure by graph laplacian eigenfunctions, which differs from PCA in that it is geometric and robust to outliers. Our approach also takes advantage of the priori sparseness of informative markers in the genome. Through simulation of a ring population and the real global population sample HGDP of 650K SNPs genotyped in 940 unrelated individuals, we validate the proposed algorithm at selecting most informative markers, a small fraction of which can recover the similar underlying population structure efficiently. Employing a standard Support Vector Machine (SVM) to predict individuals' continental memberships on HGDP dataset of seven continents, we demonstrate that the selected SNPs by our method are more informative but less redundant than those selected by PCA. Our algorithm is a promising tool in genome-wide association studies and population genetics, facilitating the selection of structure informative markers, efficient detection of population substructure and ancestral inference.

摘要

鉴定一小部分具有群体结构信息量的标记可以降低基因分型成本，并在各种应用中非常有用，例如关联作图中的祖先推断、法医学和群体遗传学中的进化理论。确定祖先信息标记的传统方法通常需要个体祖先的先验知识，并且对于混合群体有困难。最近，主成分分析（PCA）已成功地用于选择与最重要的主成分（PC）高度相关的 SNP，而无需使用个体祖先信息。该方法也适用于混合群体。在这里，我们提出了一种基于我们最近关于通过图拉普拉斯特征函数总结群体结构的结果的新方法，该方法与 PCA 不同，因为它是几何的，并且对离群值具有鲁棒性。我们的方法还利用了基因组中信息量标记的先验稀疏性。通过模拟一个环形群体和 940 个无关个体中 650K SNP 的真实全球群体样本 HGDP，我们验证了该算法在选择最具信息量标记方面的有效性，其中一小部分标记可以有效地恢复相似的潜在群体结构。在 HGDP 数据集的七个大陆上，我们使用标准支持向量机（SVM）来预测个体的大陆归属，证明了我们的方法选择的 SNP 比 PCA 选择的 SNP 更具信息量但冗余度更低。我们的算法是全基因组关联研究和群体遗传学中的一种有前途的工具，有助于选择结构信息量标记、有效检测群体亚结构和祖先推断。