Department of Oncology, Johns Hopkins University School of Medicine, Baltimore, MD, USA.
Department of Computer Science, Johns Hopkins University, Baltimore, MD, USA.
BMC Bioinformatics. 2020 Oct 14;21(1):453. doi: 10.1186/s12859-020-03796-9.
Bayesian factorization methods, including Coordinated Gene Activity in Pattern Sets (CoGAPS), are emerging as powerful analysis tools for single cell data. However, these methods have greater computational costs than their gradient-based counterparts. These costs are often prohibitive for analysis of large single-cell datasets. Many such methods can be run in parallel which enables this limitation to be overcome by running on more powerful hardware. However, the constraints imposed by the prior distributions in CoGAPS limit the applicability of parallelization methods to enhance computational efficiency for single-cell analysis.
We developed a new software framework for parallel matrix factorization in Version 3 of the CoGAPS R/Bioconductor package to overcome the computational limitations of Bayesian matrix factorization for single cell data analysis. This parallelization framework provides asynchronous updates for sequential updating steps of the algorithm to enhance computational efficiency. These algorithmic advances were coupled with new software architecture and sparse data structures to reduce the memory overhead for single-cell data.
Altogether our new software enhance the efficiency of the CoGAPS Bayesian matrix factorization algorithm so that it can analyze 1000 times more cells, enabling factorization of large single-cell data sets.
贝叶斯因子分解方法,包括模式集协调基因活性(CoGAPS),正在成为单细胞数据的强大分析工具。然而,这些方法的计算成本比基于梯度的方法更高。对于大型单细胞数据集的分析,这些成本通常是不可逾越的。许多这样的方法可以并行运行,这使得通过在更强大的硬件上运行来克服这种限制成为可能。然而,CoGAPS 中的先验分布所施加的约束限制了并行化方法的适用性,以提高单细胞分析的计算效率。
我们在 CoGAPS R/Bioconductor 包的第 3 版中开发了一个新的并行矩阵分解软件框架,以克服贝叶斯矩阵分解在单细胞数据分析中的计算限制。这个并行化框架为算法的顺序更新步骤提供异步更新,以提高计算效率。这些算法上的改进与新的软件架构和稀疏数据结构相结合,减少了单细胞数据的内存开销。
总的来说,我们的新软件增强了 CoGAPS 贝叶斯矩阵分解算法的效率,使其能够分析 1000 倍以上的细胞,从而能够对大型单细胞数据集进行分解。