基于主成分分析的群体结构推断与通用聚类算法

PCA-based population structure inference with generic clustering algorithms.

作者信息

Lee Chih, Abdool Ali, Huang Chun-Hsi

机构信息

Computer Science and Engineering Department, University of Connecticut, Storrs, CT 06269, USA.

出版信息

BMC Bioinformatics. 2009 Jan 30;10 Suppl 1(Suppl 1):S73. doi: 10.1186/1471-2105-10-S1-S73.

DOI:10.1186/1471-2105-10-S1-S73

PMID:19208178

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC2648762/

Abstract

BACKGROUND

Handling genotype data typed at hundreds of thousands of loci is very time-consuming and it is no exception for population structure inference. Therefore, we propose to apply PCA to the genotype data of a population, select the significant principal components using the Tracy-Widom distribution, and assign the individuals to one or more subpopulations using generic clustering algorithms.

RESULTS

We investigated K-means, soft K-means and spectral clustering and made comparison to STRUCTURE, a model-based algorithm specifically designed for population structure inference. Moreover, we investigated methods for predicting the number of subpopulations in a population. The results on four simulated datasets and two real datasets indicate that our approach performs comparably well to STRUCTURE. For the simulated datasets, STRUCTURE and soft K-means with BIC produced identical predictions on the number of subpopulations. We also showed that, for real dataset, BIC is a better index than likelihood in predicting the number of subpopulations.

CONCLUSION

Our approach has the advantage of being fast and scalable, while STRUCTURE is very time-consuming because of the nature of MCMC in parameter estimation. Therefore, we suggest choosing the proper algorithm based on the application of population structure inference.

摘要

背景

处理数十万个位点的基因型数据非常耗时，群体结构推断也不例外。因此，我们建议对一个群体的基因型数据应用主成分分析（PCA），使用 Tracy-Widom 分布选择显著的主成分，并使用通用聚类算法将个体分配到一个或多个亚群体。

结果

我们研究了 K 均值、软 K 均值和谱聚类，并与专门为群体结构推断设计的基于模型的算法 STRUCTURE 进行了比较。此外，我们研究了预测群体中亚群体数量的方法。在四个模拟数据集和两个真实数据集上的结果表明，我们的方法与 STRUCTURE 的表现相当。对于模拟数据集，STRUCTURE 和使用贝叶斯信息准则（BIC）的软 K 均值对亚群体数量产生了相同的预测。我们还表明，对于真实数据集，在预测亚群体数量方面，BIC 是比似然性更好的指标。

结论

我们的方法具有快速且可扩展的优点，而由于参数估计中马尔可夫链蒙特卡罗（MCMC）的性质，STRUCTURE 非常耗时。因此，我们建议根据群体结构推断的应用选择合适的算法。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/044b/2648762/6992dad57314/1471-2105-10-S1-S73-1.jpg

相似文献

PCA-based population structure inference with generic clustering algorithms.

BMC Bioinformatics. 2009 Jan 30;10 Suppl 1(Suppl 1):S73. doi: 10.1186/1471-2105-10-S1-S73.

Study of large and highly stratified population datasets by combining iterative pruning principal component analysis and structure.

BMC Bioinformatics. 2011 Jun 23;12:255. doi: 10.1186/1471-2105-12-255.

Accurate inference of subtle population structure (and other genetic discontinuities) using principal coordinates.

PLoS One. 2009;4(1):e4269. doi: 10.1371/journal.pone.0004269. Epub 2009 Jan 27.

Robust relationship inference in genome-wide association studies.

Bioinformatics. 2010 Nov 15;26(22):2867-73. doi: 10.1093/bioinformatics/btq559. Epub 2010 Oct 5.

Robust inference of population structure for ancestry prediction and correction of stratification in the presence of relatedness.

Genet Epidemiol. 2015 May;39(4):276-93. doi: 10.1002/gepi.21896. Epub 2015 Mar 23.

Iterative pruning PCA improves resolution of highly structured populations.

BMC Bioinformatics. 2009 Nov 23;10:382. doi: 10.1186/1471-2105-10-382.

SHIPS: Spectral Hierarchical clustering for the Inference of Population Structure in genetic studies.

PLoS One. 2012;7(10):e45685. doi: 10.1371/journal.pone.0045685. Epub 2012 Oct 12.

A fast least-squares algorithm for population inference.

BMC Bioinformatics. 2013 Jan 23;14:28. doi: 10.1186/1471-2105-14-28.

Inference of Population Structure from Time-Series Genotype Data.

Am J Hum Genet. 2019 Aug 1;105(2):317-333. doi: 10.1016/j.ajhg.2019.06.002. Epub 2019 Jun 27.

Fast and accurate population admixture inference from genotype data from a few microsatellites to millions of SNPs.

Heredity (Edinb). 2022 Aug;129(2):79-92. doi: 10.1038/s41437-022-00535-z. Epub 2022 May 4.

引用本文的文献

Determining population structure from k-mer frequencies.

PeerJ. 2025 Mar 5;13:e18939. doi: 10.7717/peerj.18939. eCollection 2025.

Limitations of Clustering with PCA and Correlated Noise.

J Stat Comput Simul. 2024;94(10):2291-2319. doi: 10.1080/00949655.2024.2329976. Epub 2024 May 5.

Privacy preserving identification of population stratification for collaborative genomic research.

Bioinformatics. 2023 Jun 30;39(39 Suppl 1):i168-i176. doi: 10.1093/bioinformatics/btad274.

Hybrid autoencoder with orthogonal latent space for robust population structure inference.

Sci Rep. 2023 Feb 14;13(1):2612. doi: 10.1038/s41598-023-28759-x.

Population Structure and Relatedness for Genome-Wide Association Studies.

Methods Mol Biol. 2022;2481:185-196. doi: 10.1007/978-1-0716-2237-7_12.

Genetic analysis of a potato (Solanum tuberosum L.) breeding collection for southern Colombia using Single Nucleotide Polymorphism (SNP) markers.

PLoS One. 2021 Mar 18;16(3):e0248787. doi: 10.1371/journal.pone.0248787. eCollection 2021.

Hybridization and introgression of native and foreign tree species in unique environments of protected mountainous areas.

AoB Plants. 2020 Dec 30;13(1):plaa070. doi: 10.1093/aobpla/plaa070. eCollection 2021 Feb.

Evaluating insect-host interactions as a driver of species divergence in palm flower weevils.

Commun Biol. 2020 Dec 9;3(1):749. doi: 10.1038/s42003-020-01482-3.

Detecting inversions with PCA in the presence of population structure.

PLoS One. 2020 Oct 29;15(10):e0240429. doi: 10.1371/journal.pone.0240429. eCollection 2020.

Large-Scale Hybridisation as an Extinction Threat to the Suweon Treefrog (Hylidae: ).

Animals (Basel). 2020 Apr 27;10(5):764. doi: 10.3390/ani10050764.

本文引用的文献

PLoS Genet. 2007 Sep;3(9):1672-86. doi: 10.1371/journal.pgen.0030160.

GENOME: a rapid coalescent-based whole genome simulator.

Bioinformatics. 2007 Jun 15;23(12):1565-7. doi: 10.1093/bioinformatics/btm138. Epub 2007 Apr 25.

Population structure and eigenanalysis.

PLoS Genet. 2006 Dec;2(12):e190. doi: 10.1371/journal.pgen.0020190.

Partition-distance via the assignment problem.

Bioinformatics. 2005 May 15;21(10):2463-8. doi: 10.1093/bioinformatics/bti373. Epub 2005 Mar 3.

Association mapping, using a mixture model for complex traits.

Genet Epidemiol. 2002 Aug;23(2):181-96. doi: 10.1002/gepi.210.

A human genome diversity cell line panel.

Science. 2002 Apr 12;296(5566):261-2. doi: 10.1126/science.296.5566.261b.

Inference of population structure using multilocus genotype data.

Genetics. 2000 Jun;155(2):945-59. doi: 10.1093/genetics/155.2.945.

The transmission/disequilibrium test: history, subdivision, and admixture.

Am J Hum Genet. 1995 Aug;57(2):455-64.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

基于主成分分析的群体结构推断与通用聚类算法

PCA-based population structure inference with generic clustering algorithms.

作者信息

Lee Chih, Abdool Ali, Huang Chun-Hsi

机构信息

Computer Science and Engineering Department, University of Connecticut, Storrs, CT 06269, USA.