• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

基于主成分分析的群体结构推断与通用聚类算法

PCA-based population structure inference with generic clustering algorithms.

作者信息

Lee Chih, Abdool Ali, Huang Chun-Hsi

机构信息

Computer Science and Engineering Department, University of Connecticut, Storrs, CT 06269, USA.

出版信息

BMC Bioinformatics. 2009 Jan 30;10 Suppl 1(Suppl 1):S73. doi: 10.1186/1471-2105-10-S1-S73.

DOI:10.1186/1471-2105-10-S1-S73
PMID:19208178
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC2648762/
Abstract

BACKGROUND

Handling genotype data typed at hundreds of thousands of loci is very time-consuming and it is no exception for population structure inference. Therefore, we propose to apply PCA to the genotype data of a population, select the significant principal components using the Tracy-Widom distribution, and assign the individuals to one or more subpopulations using generic clustering algorithms.

RESULTS

We investigated K-means, soft K-means and spectral clustering and made comparison to STRUCTURE, a model-based algorithm specifically designed for population structure inference. Moreover, we investigated methods for predicting the number of subpopulations in a population. The results on four simulated datasets and two real datasets indicate that our approach performs comparably well to STRUCTURE. For the simulated datasets, STRUCTURE and soft K-means with BIC produced identical predictions on the number of subpopulations. We also showed that, for real dataset, BIC is a better index than likelihood in predicting the number of subpopulations.

CONCLUSION

Our approach has the advantage of being fast and scalable, while STRUCTURE is very time-consuming because of the nature of MCMC in parameter estimation. Therefore, we suggest choosing the proper algorithm based on the application of population structure inference.

摘要

背景

处理数十万个位点的基因型数据非常耗时,群体结构推断也不例外。因此,我们建议对一个群体的基因型数据应用主成分分析(PCA),使用 Tracy-Widom 分布选择显著的主成分,并使用通用聚类算法将个体分配到一个或多个亚群体。

结果

我们研究了 K 均值、软 K 均值和谱聚类,并与专门为群体结构推断设计的基于模型的算法 STRUCTURE 进行了比较。此外,我们研究了预测群体中亚群体数量的方法。在四个模拟数据集和两个真实数据集上的结果表明,我们的方法与 STRUCTURE 的表现相当。对于模拟数据集,STRUCTURE 和使用贝叶斯信息准则(BIC)的软 K 均值对亚群体数量产生了相同的预测。我们还表明,对于真实数据集,在预测亚群体数量方面,BIC 是比似然性更好的指标。

结论

我们的方法具有快速且可扩展的优点,而由于参数估计中马尔可夫链蒙特卡罗(MCMC)的性质,STRUCTURE 非常耗时。因此,我们建议根据群体结构推断的应用选择合适的算法。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/044b/2648762/80429982c5fe/1471-2105-10-S1-S73-9.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/044b/2648762/6992dad57314/1471-2105-10-S1-S73-1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/044b/2648762/160ffd92a621/1471-2105-10-S1-S73-2.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/044b/2648762/bed1d3e01414/1471-2105-10-S1-S73-3.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/044b/2648762/004ab2f3a3d4/1471-2105-10-S1-S73-4.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/044b/2648762/49297da30105/1471-2105-10-S1-S73-5.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/044b/2648762/c004c86a4ae9/1471-2105-10-S1-S73-6.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/044b/2648762/50ef83606c86/1471-2105-10-S1-S73-7.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/044b/2648762/d2c8bc20831b/1471-2105-10-S1-S73-8.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/044b/2648762/80429982c5fe/1471-2105-10-S1-S73-9.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/044b/2648762/6992dad57314/1471-2105-10-S1-S73-1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/044b/2648762/160ffd92a621/1471-2105-10-S1-S73-2.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/044b/2648762/bed1d3e01414/1471-2105-10-S1-S73-3.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/044b/2648762/004ab2f3a3d4/1471-2105-10-S1-S73-4.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/044b/2648762/49297da30105/1471-2105-10-S1-S73-5.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/044b/2648762/c004c86a4ae9/1471-2105-10-S1-S73-6.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/044b/2648762/50ef83606c86/1471-2105-10-S1-S73-7.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/044b/2648762/d2c8bc20831b/1471-2105-10-S1-S73-8.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/044b/2648762/80429982c5fe/1471-2105-10-S1-S73-9.jpg

相似文献

1
PCA-based population structure inference with generic clustering algorithms.基于主成分分析的群体结构推断与通用聚类算法
BMC Bioinformatics. 2009 Jan 30;10 Suppl 1(Suppl 1):S73. doi: 10.1186/1471-2105-10-S1-S73.
2
Study of large and highly stratified population datasets by combining iterative pruning principal component analysis and structure.结合迭代修剪主成分分析和结构对大型高度分层人群数据集进行研究。
BMC Bioinformatics. 2011 Jun 23;12:255. doi: 10.1186/1471-2105-12-255.
3
Accurate inference of subtle population structure (and other genetic discontinuities) using principal coordinates.使用主坐标精确推断细微的群体结构(以及其他遗传不连续性)。
PLoS One. 2009;4(1):e4269. doi: 10.1371/journal.pone.0004269. Epub 2009 Jan 27.
4
Robust relationship inference in genome-wide association studies.全基因组关联研究中的稳健关系推断。
Bioinformatics. 2010 Nov 15;26(22):2867-73. doi: 10.1093/bioinformatics/btq559. Epub 2010 Oct 5.
5
Robust inference of population structure for ancestry prediction and correction of stratification in the presence of relatedness.在存在亲缘关系的情况下,对群体结构进行稳健推断,以进行血统预测和分层校正。
Genet Epidemiol. 2015 May;39(4):276-93. doi: 10.1002/gepi.21896. Epub 2015 Mar 23.
6
Iterative pruning PCA improves resolution of highly structured populations.迭代剪枝主成分分析提高高度结构化群体的分辨率。
BMC Bioinformatics. 2009 Nov 23;10:382. doi: 10.1186/1471-2105-10-382.
7
SHIPS: Spectral Hierarchical clustering for the Inference of Population Structure in genetic studies.SHIPS:遗传研究中用于推断群体结构的谱层次聚类。
PLoS One. 2012;7(10):e45685. doi: 10.1371/journal.pone.0045685. Epub 2012 Oct 12.
8
A fast least-squares algorithm for population inference.一种快速的用于群体推断的最小二乘法。
BMC Bioinformatics. 2013 Jan 23;14:28. doi: 10.1186/1471-2105-14-28.
9
Inference of Population Structure from Time-Series Genotype Data.基于时间序列基因型数据推断种群结构。
Am J Hum Genet. 2019 Aug 1;105(2):317-333. doi: 10.1016/j.ajhg.2019.06.002. Epub 2019 Jun 27.
10
Fast and accurate population admixture inference from genotype data from a few microsatellites to millions of SNPs.从少数微卫星到数百万个 SNPs 的基因型数据中快速准确地推断人群混合。
Heredity (Edinb). 2022 Aug;129(2):79-92. doi: 10.1038/s41437-022-00535-z. Epub 2022 May 4.

引用本文的文献

1
Determining population structure from k-mer frequencies.从k-mer频率确定群体结构。
PeerJ. 2025 Mar 5;13:e18939. doi: 10.7717/peerj.18939. eCollection 2025.
2
Limitations of Clustering with PCA and Correlated Noise.主成分分析(PCA)聚类及相关噪声的局限性
J Stat Comput Simul. 2024;94(10):2291-2319. doi: 10.1080/00949655.2024.2329976. Epub 2024 May 5.
3
Privacy preserving identification of population stratification for collaborative genomic research.用于合作基因组研究的群体分层隐私保护识别。

本文引用的文献

1
PCA-correlated SNPs for structure identification in worldwide human populations.用于全球人类群体结构识别的与主成分分析相关的单核苷酸多态性
PLoS Genet. 2007 Sep;3(9):1672-86. doi: 10.1371/journal.pgen.0030160.
2
GENOME: a rapid coalescent-based whole genome simulator.基因组:一种基于快速合并的全基因组模拟器。
Bioinformatics. 2007 Jun 15;23(12):1565-7. doi: 10.1093/bioinformatics/btm138. Epub 2007 Apr 25.
3
Population structure and eigenanalysis.群体结构与特征分析
Bioinformatics. 2023 Jun 30;39(39 Suppl 1):i168-i176. doi: 10.1093/bioinformatics/btad274.
4
Hybrid autoencoder with orthogonal latent space for robust population structure inference.具有正交潜在空间的混合自动编码器,用于稳健的群体结构推断。
Sci Rep. 2023 Feb 14;13(1):2612. doi: 10.1038/s41598-023-28759-x.
5
Population Structure and Relatedness for Genome-Wide Association Studies.群体结构与全基因组关联研究的相关性。
Methods Mol Biol. 2022;2481:185-196. doi: 10.1007/978-1-0716-2237-7_12.
6
Genetic analysis of a potato (Solanum tuberosum L.) breeding collection for southern Colombia using Single Nucleotide Polymorphism (SNP) markers.利用单核苷酸多态性(SNP)标记对哥伦比亚南部地区的马铃薯(Solanum tuberosum L.)育种群体进行遗传分析。
PLoS One. 2021 Mar 18;16(3):e0248787. doi: 10.1371/journal.pone.0248787. eCollection 2021.
7
Hybridization and introgression of native and foreign tree species in unique environments of protected mountainous areas.在受保护山区的独特环境中,本地和外来树种的杂交与基因渗入。
AoB Plants. 2020 Dec 30;13(1):plaa070. doi: 10.1093/aobpla/plaa070. eCollection 2021 Feb.
8
Evaluating insect-host interactions as a driver of species divergence in palm flower weevils.评估昆虫-宿主相互作用对棕榈花象甲物种分化的驱动作用。
Commun Biol. 2020 Dec 9;3(1):749. doi: 10.1038/s42003-020-01482-3.
9
Detecting inversions with PCA in the presence of population structure.在存在群体结构的情况下使用 PCA 检测倒位。
PLoS One. 2020 Oct 29;15(10):e0240429. doi: 10.1371/journal.pone.0240429. eCollection 2020.
10
Large-Scale Hybridisation as an Extinction Threat to the Suweon Treefrog (Hylidae: ).大规模杂交对水原树蛙(雨蛙科: )构成灭绝威胁。
Animals (Basel). 2020 Apr 27;10(5):764. doi: 10.3390/ani10050764.
PLoS Genet. 2006 Dec;2(12):e190. doi: 10.1371/journal.pgen.0020190.
4
Partition-distance via the assignment problem.通过分配问题实现的划分距离
Bioinformatics. 2005 May 15;21(10):2463-8. doi: 10.1093/bioinformatics/bti373. Epub 2005 Mar 3.
5
Association mapping, using a mixture model for complex traits.关联定位,使用复杂性状的混合模型。
Genet Epidemiol. 2002 Aug;23(2):181-96. doi: 10.1002/gepi.210.
6
A human genome diversity cell line panel.一个人类基因组多样性细胞系面板。
Science. 2002 Apr 12;296(5566):261-2. doi: 10.1126/science.296.5566.261b.
7
Inference of population structure using multilocus genotype data.利用多位点基因型数据推断群体结构。
Genetics. 2000 Jun;155(2):945-59. doi: 10.1093/genetics/155.2.945.
8
The transmission/disequilibrium test: history, subdivision, and admixture.传递/不平衡检验:历史、细分与混合
Am J Hum Genet. 1995 Aug;57(2):455-64.