• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

用于大规模生物库数据的快速准确的核外 PCA 框架。

Fast and accurate out-of-core PCA framework for large scale biobank data.

机构信息

Section for Computational and RNA Biology, Department of Biology, University of Copenhagen, 2200 København, Denmark;

Biological and Precision Psychiatry, Mental Health Centre Copenhagen, Copenhagen University Hospital, 2100 København, Denmark.

出版信息

Genome Res. 2023 Sep;33(9):1599-1608. doi: 10.1101/gr.277525.122. Epub 2023 Aug 24.

DOI:10.1101/gr.277525.122
PMID:37620119
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC10620046/
Abstract

Principal component analysis (PCA) is widely used in statistics, machine learning, and genomics for dimensionality reduction and uncovering low-dimensional latent structure. To address the challenges posed by ever-growing data size, fast and memory-efficient PCA methods have gained prominence. In this paper, we propose a novel randomized singular value decomposition (RSVD) algorithm implemented in PCAone, featuring a window-based optimization scheme that enables accelerated convergence while improving the accuracy. Additionally, PCAone incorporates out-of-core and multithreaded implementations for the existing Implicitly Restarted Arnoldi Method (IRAM) and RSVD. Through comprehensive evaluations using multiple large-scale real-world data sets in different fields, we show the advantage of PCAone over existing methods. The new algorithm achieves significantly faster computation time while maintaining accuracy comparable to the slower IRAM method. Notably, our analyses of UK Biobank, comprising around 0.5 million individuals and 6.1 million common single nucleotide polymorphisms, show that PCAone accurately computes the top 40 principal components within 9 h. This analysis effectively captures population structure, signals of selection, structural variants, and low recombination regions, utilizing <20 GB of memory and 20 CPU threads. Furthermore, when applied to single-cell RNA sequencing data featuring 1.3 million cells, PCAone, accurately capturing the top 40 principal components in 49 min. This performance represents a 10-fold improvement over state-of-the-art tools.

摘要

主成分分析(PCA)在统计学、机器学习和基因组学中被广泛用于降维和揭示低维潜在结构。为了解决数据规模不断增长带来的挑战,快速且节省内存的 PCA 方法得到了重视。在本文中,我们提出了一种新的随机奇异值分解(RSVD)算法,该算法在 PCAone 中实现,具有基于窗口的优化方案,可加速收敛并提高准确性。此外,PCAone 为现有的隐式重启 Arnoldi 方法(IRAM)和 RSVD 实现了核外和多线程。通过在不同领域的多个大规模真实数据集上进行全面评估,我们展示了 PCAone 相对于现有方法的优势。新算法在保持与较慢的 IRAM 方法相当的准确性的同时,显著缩短了计算时间。值得注意的是,我们对包含约 50 万个个体和 610 万个常见单核苷酸多态性的 UK Biobank 的分析表明,PCAone 可以在 9 小时内准确计算前 40 个主成分。该分析有效地捕获了群体结构、选择信号、结构变体和低重组区域,仅使用<20GB 的内存和 20 个 CPU 线程。此外,当应用于具有 130 万个细胞的单细胞 RNA 测序数据时,PCAone 可以在 49 分钟内准确捕获前 40 个主成分。这一性能比最先进的工具提高了 10 倍。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/3632/10620046/8295f90d95dd/1599f04.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/3632/10620046/a7be32c5145b/1599f01.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/3632/10620046/a5124c8ed001/1599f02.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/3632/10620046/06fb48c8673a/1599f03.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/3632/10620046/8295f90d95dd/1599f04.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/3632/10620046/a7be32c5145b/1599f01.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/3632/10620046/a5124c8ed001/1599f02.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/3632/10620046/06fb48c8673a/1599f03.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/3632/10620046/8295f90d95dd/1599f04.jpg

相似文献

1
Fast and accurate out-of-core PCA framework for large scale biobank data.用于大规模生物库数据的快速准确的核外 PCA 框架。
Genome Res. 2023 Sep;33(9):1599-1608. doi: 10.1101/gr.277525.122. Epub 2023 Aug 24.
2
OCMA: Fast, Memory-Efficient Factorization of Prohibitively Large Relationship Matrices.OCMA:快速、高效地分解超大关系矩阵。
G3 (Bethesda). 2019 Jan 9;9(1):13-19. doi: 10.1534/g3.118.200908.
3
A fast and scalable framework for large-scale and ultrahigh-dimensional sparse regression with application to the UK Biobank.一种快速且可扩展的大规模超高维稀疏回归框架及其在 UK Biobank 中的应用。
PLoS Genet. 2020 Oct 23;16(10):e1009141. doi: 10.1371/journal.pgen.1009141. eCollection 2020 Oct.
4
Scalable probabilistic PCA for large-scale genetic variation data.可扩展概率主成分分析在大规模遗传变异数据中的应用。
PLoS Genet. 2020 May 29;16(5):e1008773. doi: 10.1371/journal.pgen.1008773. eCollection 2020 May.
5
Stream-based Hebbian eigenfilter for real-time neuronal spike discrimination.基于流的海伯自生特征滤波器用于实时神经元尖峰甄别。
Biomed Eng Online. 2012 Apr 10;11:18. doi: 10.1186/1475-925X-11-18.
6
Fast principal component analysis of large-scale genome-wide data.大规模全基因组数据的快速主成分分析。
PLoS One. 2014 Apr 9;9(4):e93766. doi: 10.1371/journal.pone.0093766. eCollection 2014.
7
Rye: genetic ancestry inference at biobank scale.黑麦:生物库规模的遗传祖先推断。
Nucleic Acids Res. 2023 May 8;51(8):e44. doi: 10.1093/nar/gkad149.
8
Fast and robust ancestry prediction using principal component analysis.利用主成分分析进行快速稳健的祖源预测。
Bioinformatics. 2020 Jun 1;36(11):3439-3446. doi: 10.1093/bioinformatics/btaa152.
9
Fast and compact matching statistics analytics.快速且紧凑的匹配统计分析。
Bioinformatics. 2022 Mar 28;38(7):1838-1845. doi: 10.1093/bioinformatics/btac064.
10
Memory efficient principal component analysis for the dimensionality reduction of large mass spectrometry imaging data sets.基于记忆优化的主成分分析在大质谱成像数据集降维中的应用。
Anal Chem. 2013 Mar 19;85(6):3071-8. doi: 10.1021/ac302528v. Epub 2013 Mar 6.

引用本文的文献

1
randPedPCA: rapid approximation of principal components from large pedigrees.randPedPCA:从大型家系中快速近似主成分
Genet Sel Evol. 2025 Aug 28;57(1):46. doi: 10.1186/s12711-025-00994-y.
2
Genetic architecture in Greenland is shaped by demography, structure and selection.格陵兰岛的遗传结构受人口统计学、结构和选择的影响。
Nature. 2025 Mar;639(8054):404-410. doi: 10.1038/s41586-024-08516-4. Epub 2025 Feb 12.
3
Leveraging haplotype information in heritability estimation and polygenic prediction.在遗传力估计和多基因预测中利用单倍型信息。

本文引用的文献

1
Recurrent inversion polymorphisms in humans associate with genetic instability and genomic disorders.人类中的复发性倒位多态性与遗传不稳定性和基因组疾病相关。
Cell. 2022 May 26;185(11):1986-2005.e26. doi: 10.1016/j.cell.2022.04.017. Epub 2022 May 6.
2
Large-scale inference of population structure in presence of missingness using PCA.使用主成分分析(PCA)在存在缺失值的情况下对群体结构进行大规模推断。
Bioinformatics. 2021 Jul 27;37(13):1868-1875. doi: 10.1093/bioinformatics/btab027.
3
Scalable probabilistic PCA for large-scale genetic variation data.
Nat Commun. 2025 Jan 2;16(1):126. doi: 10.1038/s41467-024-55477-3.
4
FastRNA: An efficient solution for PCA of single-cell RNA-sequencing data based on a batch-accounting count model.FastRNA:基于批处理计数模型的单细胞 RNA-seq 数据主成分分析的有效解决方案。
Am J Hum Genet. 2022 Nov 3;109(11):1974-1985. doi: 10.1016/j.ajhg.2022.09.008. Epub 2022 Oct 6.
可扩展概率主成分分析在大规模遗传变异数据中的应用。
PLoS Genet. 2020 May 29;16(5):e1008773. doi: 10.1371/journal.pgen.1008773. eCollection 2020 May.
4
Efficient toolkit implementing best practices for principal component analysis of population genetic data.高效工具包,实现了群体遗传数据主成分分析的最佳实践。
Bioinformatics. 2020 Aug 15;36(16):4449-4457. doi: 10.1093/bioinformatics/btaa520.
5
Benchmarking principal component analysis for large-scale single-cell RNA-sequencing.基于主成分分析的大规模单细胞 RNA-seq 基准测试
Genome Biol. 2020 Jan 20;21(1):9. doi: 10.1186/s13059-019-1900-3.
6
UMAP reveals cryptic population structure and phenotype heterogeneity in large genomic cohorts.UMAP 揭示了大型基因组队列中的隐藏种群结构和表型异质性。
PLoS Genet. 2019 Nov 1;15(11):e1008432. doi: 10.1371/journal.pgen.1008432. eCollection 2019 Nov.
7
TeraPCA: a fast and scalable software package to study genetic variation in tera-scale genotypes.TeraPCA:一个快速且可扩展的软件包,用于研究万亿级基因型中的遗传变异。
Bioinformatics. 2019 Oct 1;35(19):3679-3683. doi: 10.1093/bioinformatics/btz157.
8
Challenges in unsupervised clustering of single-cell RNA-seq data.无监督单细胞 RNA-seq 数据聚类的挑战。
Nat Rev Genet. 2019 May;20(5):273-282. doi: 10.1038/s41576-018-0088-9.
9
Inferring Population Structure and Admixture Proportions in Low-Depth NGS Data.在低深度 NGS 数据中推断群体结构和混合比例。
Genetics. 2018 Oct;210(2):719-731. doi: 10.1534/genetics.118.301336. Epub 2018 Aug 21.
10
Recovering Gene Interactions from Single-Cell Data Using Data Diffusion.利用数据扩散从单细胞数据中恢复基因相互作用。
Cell. 2018 Jul 26;174(3):716-729.e27. doi: 10.1016/j.cell.2018.05.061. Epub 2018 Jun 28.