C-ziptf：用于零膨胀多维基因组学数据的稳定张量分解。

C-ziptf: stable tensor factorization for zero-inflated multi-dimensional genomics data.

机构信息

Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, 19104, USA.

Broad Institute of MIT and Harvard, Cambridge, MA, 02142, USA.

出版信息

BMC Bioinformatics. 2024 Oct 5;25(1):323. doi: 10.1186/s12859-024-05886-4.

DOI:10.1186/s12859-024-05886-4

PMID:39369208

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11456250/

Abstract

In the past two decades, genomics has advanced significantly, with single-cell RNA-sequencing (scRNA-seq) marking a pivotal milestone. ScRNA-seq provides unparalleled insights into cellular diversity and has spurred diverse studies across multiple conditions and samples, resulting in an influx of complex multidimensional genomics data. This highlights the need for robust methodologies capable of handling the complexity and multidimensionality of such genomics data. Furthermore, single-cell data grapples with sparsity due to issues like low capture efficiency and dropout effects. Tensor factorizations (TF) have emerged as powerful tools to unravel the complex patterns from multi-dimensional genomics data. Classic TF methods, based on maximum likelihood estimation, struggle with zero-inflated count data, while the inherent stochasticity in TFs further complicates result interpretation and reproducibility. Our paper introduces Zero Inflated Poisson Tensor Factorization (ZIPTF), a novel method for high-dimensional zero-inflated count data factorization. We also present Consensus-ZIPTF (C-ZIPTF), merging ZIPTF with a consensus-based approach to address stochasticity. We evaluate our proposed methods on synthetic zero-inflated count data, simulated scRNA-seq data, and real multi-sample multi-condition scRNA-seq datasets. ZIPTF consistently outperforms baseline matrix and tensor factorization methods, displaying enhanced reconstruction accuracy for zero-inflated data. When dealing with high probabilities of excess zeros, ZIPTF achieves up to better accuracy. Moreover, C-ZIPTF notably enhances the factorization's consistency. When tested on synthetic and real scRNA-seq data, ZIPTF and C-ZIPTF consistently uncover known and biologically meaningful gene expression programs. Access our data and code at: https://github.com/klarman-cell-observatory/scBTF and https://github.com/klarman-cell-observatory/scbtf_experiments .

摘要

在过去的二十年中，基因组学取得了显著的进展，单细胞 RNA 测序（scRNA-seq）标志着一个关键的里程碑。scRNA-seq 提供了无与伦比的细胞多样性见解，并推动了多种条件和样本的研究，导致复杂的多维基因组学数据大量涌入。这凸显了需要稳健的方法来处理这种基因组学数据的复杂性和多维性。此外，由于捕获效率低和缺失效应等问题，单细胞数据存在稀疏性。张量分解（TF）已成为从多维基因组学数据中揭示复杂模式的强大工具。基于最大似然估计的经典 TF 方法在处理零膨胀计数数据方面存在困难，而 TF 中的固有随机性进一步增加了结果解释和可重复性的复杂性。我们的论文介绍了零膨胀泊松张量分解（ZIPTF），这是一种用于高维零膨胀计数数据分解的新方法。我们还提出了共识零膨胀泊松张量分解（C-ZIPTF），通过将 ZIPTF 与基于共识的方法相结合来解决随机性问题。我们在合成零膨胀计数数据、模拟 scRNA-seq 数据和真实多样本多条件 scRNA-seq 数据集上评估了我们提出的方法。ZIPTF 始终优于基线矩阵和张量分解方法，显示出对零膨胀数据的重建准确性更高。当处理过高的零过量概率时，ZIPTF 可以达到高达的更高准确性。此外，C-ZIPTF 显著增强了分解的一致性。在对合成和真实 scRNA-seq 数据的测试中，ZIPTF 和 C-ZIPTF 一致地揭示了已知的和具有生物学意义的基因表达程序。访问我们的数据和代码：https://github.com/klarman-cell-observatory/scBTF 和 https://github.com/klarman-cell-observatory/scbtf_experiments 。