Section for Computational and RNA Biology, Department of Biology, University of Copenhagen, 2200 København, Denmark;
Biological and Precision Psychiatry, Mental Health Centre Copenhagen, Copenhagen University Hospital, 2100 København, Denmark.
Genome Res. 2023 Sep;33(9):1599-1608. doi: 10.1101/gr.277525.122. Epub 2023 Aug 24.
Principal component analysis (PCA) is widely used in statistics, machine learning, and genomics for dimensionality reduction and uncovering low-dimensional latent structure. To address the challenges posed by ever-growing data size, fast and memory-efficient PCA methods have gained prominence. In this paper, we propose a novel randomized singular value decomposition (RSVD) algorithm implemented in PCAone, featuring a window-based optimization scheme that enables accelerated convergence while improving the accuracy. Additionally, PCAone incorporates out-of-core and multithreaded implementations for the existing Implicitly Restarted Arnoldi Method (IRAM) and RSVD. Through comprehensive evaluations using multiple large-scale real-world data sets in different fields, we show the advantage of PCAone over existing methods. The new algorithm achieves significantly faster computation time while maintaining accuracy comparable to the slower IRAM method. Notably, our analyses of UK Biobank, comprising around 0.5 million individuals and 6.1 million common single nucleotide polymorphisms, show that PCAone accurately computes the top 40 principal components within 9 h. This analysis effectively captures population structure, signals of selection, structural variants, and low recombination regions, utilizing <20 GB of memory and 20 CPU threads. Furthermore, when applied to single-cell RNA sequencing data featuring 1.3 million cells, PCAone, accurately capturing the top 40 principal components in 49 min. This performance represents a 10-fold improvement over state-of-the-art tools.
主成分分析(PCA)在统计学、机器学习和基因组学中被广泛用于降维和揭示低维潜在结构。为了解决数据规模不断增长带来的挑战,快速且节省内存的 PCA 方法得到了重视。在本文中,我们提出了一种新的随机奇异值分解(RSVD)算法,该算法在 PCAone 中实现,具有基于窗口的优化方案,可加速收敛并提高准确性。此外,PCAone 为现有的隐式重启 Arnoldi 方法(IRAM)和 RSVD 实现了核外和多线程。通过在不同领域的多个大规模真实数据集上进行全面评估,我们展示了 PCAone 相对于现有方法的优势。新算法在保持与较慢的 IRAM 方法相当的准确性的同时,显著缩短了计算时间。值得注意的是,我们对包含约 50 万个个体和 610 万个常见单核苷酸多态性的 UK Biobank 的分析表明,PCAone 可以在 9 小时内准确计算前 40 个主成分。该分析有效地捕获了群体结构、选择信号、结构变体和低重组区域,仅使用<20GB 的内存和 20 个 CPU 线程。此外,当应用于具有 130 万个细胞的单细胞 RNA 测序数据时,PCAone 可以在 49 分钟内准确捕获前 40 个主成分。这一性能比最先进的工具提高了 10 倍。