• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

大规模全基因组数据的快速主成分分析。

Fast principal component analysis of large-scale genome-wide data.

机构信息

Medical Systems Biology, Department of Pathology and Department of Microbiology & Immunology, University of Melbourne, Parkville, Victoria, Australia.

出版信息

PLoS One. 2014 Apr 9;9(4):e93766. doi: 10.1371/journal.pone.0093766. eCollection 2014.

DOI:10.1371/journal.pone.0093766
PMID:24718290
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC3981753/
Abstract

Principal component analysis (PCA) is routinely used to analyze genome-wide single-nucleotide polymorphism (SNP) data, for detecting population structure and potential outliers. However, the size of SNP datasets has increased immensely in recent years and PCA of large datasets has become a time consuming task. We have developed flashpca, a highly efficient PCA implementation based on randomized algorithms, which delivers identical accuracy in extracting the top principal components compared with existing tools, in substantially less time. We demonstrate the utility of flashpca on both HapMap3 and on a large Immunochip dataset. For the latter, flashpca performed PCA of 15,000 individuals up to 125 times faster than existing tools, with identical results, and PCA of 150,000 individuals using flashpca completed in 4 hours. The increasing size of SNP datasets will make tools such as flashpca essential as traditional approaches will not adequately scale. This approach will also help to scale other applications that leverage PCA or eigen-decomposition to substantially larger datasets.

摘要

主成分分析(PCA)通常用于分析全基因组单核苷酸多态性(SNP)数据,以检测群体结构和潜在的异常值。然而,近年来 SNP 数据集的规模已经大大增加,对大型数据集进行 PCA 已经成为一项耗时的任务。我们开发了 flashpca,这是一种基于随机算法的高效 PCA 实现方法,与现有工具相比,在提取主要成分方面具有相同的准确性,但时间大大缩短。我们在 HapMap3 和大型 Immunochip 数据集上展示了 flashpca 的实用性。对于后者,flashpca 对 15000 个人进行 PCA 的速度比现有工具快 125 倍,结果相同,而使用 flashpca 对 150000 个人进行 PCA 则在 4 小时内完成。随着 SNP 数据集规模的不断增加,像 flashpca 这样的工具将变得至关重要,因为传统方法将无法充分扩展。这种方法还将有助于将 PCA 或特征分解等其他应用扩展到更大的数据集。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/5380/3981753/56bb907f6ec3/pone.0093766.g002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/5380/3981753/d978a0d54c91/pone.0093766.g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/5380/3981753/56bb907f6ec3/pone.0093766.g002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/5380/3981753/d978a0d54c91/pone.0093766.g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/5380/3981753/56bb907f6ec3/pone.0093766.g002.jpg

相似文献

1
Fast principal component analysis of large-scale genome-wide data.大规模全基因组数据的快速主成分分析。
PLoS One. 2014 Apr 9;9(4):e93766. doi: 10.1371/journal.pone.0093766. eCollection 2014.
2
FlashPCA2: principal component analysis of Biobank-scale genotype datasets.FlashPCA2:生物样本库规模基因型数据集的主成分分析
Bioinformatics. 2017 Sep 1;33(17):2776-2778. doi: 10.1093/bioinformatics/btx299.
3
Study of large and highly stratified population datasets by combining iterative pruning principal component analysis and structure.结合迭代修剪主成分分析和结构对大型高度分层人群数据集进行研究。
BMC Bioinformatics. 2011 Jun 23;12:255. doi: 10.1186/1471-2105-12-255.
4
Sparse principal component analysis for identifying ancestry-informative markers in genome-wide association studies.稀疏主成分分析在全基因组关联研究中识别与祖先相关的标记。
Genet Epidemiol. 2012 May;36(4):293-302. doi: 10.1002/gepi.21621. Epub 2012 Apr 16.
5
A novel and fast approach for population structure inference using kernel-PCA and optimization.一种使用核主成分分析和优化进行群体结构推断的新颖快速方法。
Genetics. 2014 Dec;198(4):1421-31. doi: 10.1534/genetics.114.171314. Epub 2014 Oct 16.
6
OCMA: Fast, Memory-Efficient Factorization of Prohibitively Large Relationship Matrices.OCMA:快速、高效地分解超大关系矩阵。
G3 (Bethesda). 2019 Jan 9;9(1):13-19. doi: 10.1534/g3.118.200908.
7
A high-performance computing toolset for relatedness and principal component analysis of SNP data.用于 SNP 数据亲缘关系和主成分分析的高性能计算工具集。
Bioinformatics. 2012 Dec 15;28(24):3326-8. doi: 10.1093/bioinformatics/bts606. Epub 2012 Oct 11.
8
Association test based on SNP set: logistic kernel machine based test vs. principal component analysis.基于 SNP 集的关联测试:逻辑核机器测试与主成分分析。
PLoS One. 2012;7(9):e44978. doi: 10.1371/journal.pone.0044978. Epub 2012 Sep 13.
9
A combined long-range phasing and long haplotype imputation method to impute phase for SNP genotypes.一种结合长程相位和长单倍型推断方法的 SNP 基因型相位推断。
Genet Sel Evol. 2011 Mar 10;43(1):12. doi: 10.1186/1297-9686-43-12.
10
Fast and efficient correction for population stratification in multi-locus genome-wide association studies.多基因座全基因组关联研究中人群分层的快速高效校正。
Genetica. 2021 Dec;149(5-6):313-325. doi: 10.1007/s10709-021-00129-3. Epub 2021 Sep 4.

引用本文的文献

1
variants drive chromosomal fission and accelerate speciation in zokors.变异驱动鼢鼠的染色体裂变并加速物种形成。
Sci Adv. 2025 Sep 5;11(36):eadt2282. doi: 10.1126/sciadv.adt2282.
2
The polygenic architecture of hidradenitis suppurativa reveals signaling mechanisms that implicate epithelial remodeling.化脓性汗腺炎的多基因结构揭示了涉及上皮重塑的信号传导机制。
medRxiv. 2025 Jul 28:2025.07.25.25332168. doi: 10.1101/2025.07.25.25332168.
3
Regenie.QRS: computationally efficient whole-genome quantile regression at biobank scale.Regenie.QRS:生物样本库规模下计算效率高的全基因组分位数回归

本文引用的文献

1
A Lasso multi-marker mixed model for association mapping with population structure correction.带有群体结构校正的关联作图的套索多标记混合模型。
Bioinformatics. 2013 Jan 15;29(2):206-14. doi: 10.1093/bioinformatics/bts669. Epub 2012 Nov 22.
2
Sparse principal component analysis for identifying ancestry-informative markers in genome-wide association studies.稀疏主成分分析在全基因组关联研究中识别与祖先相关的标记。
Genet Epidemiol. 2012 May;36(4):293-302. doi: 10.1002/gepi.21621. Epub 2012 Apr 16.
3
Dense genotyping identifies and localizes multiple common and rare variant association signals in celiac disease.
bioRxiv. 2025 May 7:2025.05.02.651730. doi: 10.1101/2025.05.02.651730.
4
Clinical and Genetic Factors Associated with Intraoperative Minimum Alveolar Concentration Ratio: A Single-center Retrospective Cohort and Genome-wide Association Study.与术中最低肺泡浓度比值相关的临床和遗传因素:一项单中心回顾性队列研究和全基因组关联研究
Anesthesiology. 2025 Jul 21. doi: 10.1097/ALN.0000000000005602.
5
A multi-ancestry genetic reference for the Quebec population.魁北克人群的多祖先遗传参考。
medRxiv. 2025 May 16:2025.05.14.25327536. doi: 10.1101/2025.05.14.25327536.
6
Admixture and selection offer insights for the conservation and breeding of Guyuan cattle.杂交和选择为固原牛的保护和育种提供了见解。
BMC Biol. 2025 May 13;23(1):128. doi: 10.1186/s12915-025-02213-y.
7
Contributions of common and rare genetic variation to different measures of mood and anxiety disorder in the UK Biobank.常见和罕见基因变异对英国生物银行中不同情绪和焦虑症测量指标的贡献。
BJPsych Open. 2025 May 9;11(3):e97. doi: 10.1192/bjo.2025.43.
8
Examining the association between fetal , maternal haplotypes and birth weight.研究胎儿、母亲单倍型与出生体重之间的关联。
medRxiv. 2025 Apr 10:2025.04.09.25325484. doi: 10.1101/2025.04.09.25325484.
9
Plio-Pleistocene Climatic Fluctuations and Divergence With Gene Flow Drive Continent-Wide Diversification in an African Bird.上新世-更新世气候波动与基因流驱动下的分化促使一种非洲鸟类在整个大陆范围内多样化发展。
Mol Ecol. 2025 May;34(10):e17770. doi: 10.1111/mec.17770. Epub 2025 Apr 21.
10
A genetically informed brain atlas for enhancing brain imaging genomics.一种用于增强脑成像基因组学的遗传信息脑图谱。
Nat Commun. 2025 Apr 14;16(1):3524. doi: 10.1038/s41467-025-57636-6.
高密度基因分型鉴定和定位了乳糜泻中的多个常见和罕见变异关联信号。
Nat Genet. 2011 Nov 6;43(12):1193-201. doi: 10.1038/ng.998.
4
FaST linear mixed models for genome-wide association studies.Fast 线性混合模型在全基因组关联研究中的应用。
Nat Methods. 2011 Sep 4;8(10):833-5. doi: 10.1038/nmeth.1681.
5
Integrating common and rare genetic variation in diverse human populations.整合不同人类群体中的常见和罕见遗传变异。
Nature. 2010 Sep 2;467(7311):52-8. doi: 10.1038/nature09298.
6
On Consistency and Sparsity for Principal Components Analysis in High Dimensions.高维主成分分析中的一致性与稀疏性
J Am Stat Assoc. 2009 Jun 1;104(486):682-693. doi: 10.1198/jasa.2009.0121.
7
varLD: a program for quantifying variation in linkage disequilibrium patterns between populations.varLD:用于量化群体间连锁不平衡模式变化的程序。
Bioinformatics. 2010 May 1;26(9):1269-70. doi: 10.1093/bioinformatics/btq125. Epub 2010 Mar 22.
8
Common genetic variation and the control of HIV-1 in humans.常见遗传变异与人类对 HIV-1 的控制。
PLoS Genet. 2009 Dec;5(12):e1000791. doi: 10.1371/journal.pgen.1000791. Epub 2009 Dec 24.
9
Genes mirror geography within Europe.基因反映了欧洲内部的地理特征。
Nature. 2008 Nov 6;456(7218):98-101. doi: 10.1038/nature07331. Epub 2008 Aug 31.
10
PLINK: a tool set for whole-genome association and population-based linkage analyses.PLINK:一个用于全基因组关联分析和基于群体的连锁分析的工具集。
Am J Hum Genet. 2007 Sep;81(3):559-75. doi: 10.1086/519795. Epub 2007 Jul 25.