Pan Yue, Landis Justin T, Moorad Razia, Wu Di, Marron J S, Dittmer Dirk P
Res Sq. 2023 Feb 6:rs.3.rs-2517698. doi: 10.21203/rs.3.rs-2517698/v1.
Modeling of single cell RNA-sequencing (scRNA-seq) data remains challenging due to a high percentage of zeros and data heterogeneity, so improved modeling has strong potential to benefit many downstream data analyses. The existing zero-inflated or over-dispersed models are based on aggregations at either the gene or the cell level. However, they typically lose accuracy due to a too crude aggregation at those two levels. We avoid the crude approximations entailed by such aggregation through proposing an Independent Poisson Distribution (IPD) particularly at each individual entry in the scRNA-seq data matrix. This approach naturally and intuitively models the large number of zeros as matrix entries with a very small Poisson parameter. The critical challenge of cell clustering is approached via a novel data representation as Departures from a simple homogeneous IPD (DIPD) to capture the per-gene-per-cell intrinsic heterogeneity generated by cell clusters. Our experiments using real data and crafted experiments show that using DIPD as a data representation for scRNA-seq data can uncover novel cell subtypes that are missed or can only be found by careful parameter tuning using conventional methods. This new method has multiple advantages, including (1) no needfor prior feature selection or manual optimization of hyperparameters; (2) flexibility to combine with and improve upon other methods, such as Seurat. Another novel contribution is the use of crafted experiments as part of the validation of our newly developed DIPD-based clustering pipeline. This new clustering pipeline is implemented in the R (CRAN) package .
由于单细胞RNA测序(scRNA-seq)数据中零值比例高且数据具有异质性,对其进行建模仍然具有挑战性,因此改进的建模方法有很大潜力使许多下游数据分析受益。现有的零膨胀或过度离散模型基于基因或细胞水平的聚合。然而,由于在这两个水平上的聚合过于粗糙,它们通常会失去准确性。我们通过提出独立泊松分布(IPD)来避免这种聚合带来的粗糙近似,特别是针对scRNA-seq数据矩阵中的每个单独条目。这种方法自然且直观地将大量零值建模为具有非常小泊松参数的矩阵条目。通过一种新颖的数据表示方法,即偏离简单均匀IPD(DIPD),来解决细胞聚类的关键挑战,以捕获由细胞簇产生的每个基因每个细胞的内在异质性。我们使用真实数据和精心设计的实验表明,将DIPD用作scRNA-seq数据的表示方法可以发现传统方法遗漏或只能通过仔细调整参数才能找到的新细胞亚型。这种新方法具有多个优点,包括(1)无需事先进行特征选择或手动优化超参数;(2)可灵活与其他方法(如Seurat)结合并进行改进。另一个新颖的贡献是使用精心设计的实验作为我们新开发的基于DIPD的聚类管道验证的一部分。这个新的聚类管道在R(CRAN)包中实现。
BMC Bioinformatics. 2023-6-17
Bioinformatics. 2018-1-1
MethodsX. 2021-11-17
Brief Funct Genomics. 2022-11-17
Cochrane Database Syst Rev. 2022-2-1
Bioinformatics. 2023-3-1
Brief Bioinform. 2023-1-19