Suppr超能文献

概率计数矩阵分解用于单细胞表达数据分析。

Probabilistic count matrix factorization for single cell expression data analysis.

机构信息

Univ Lyon, Université Lyon 1, CNRS, LBBE UMR 5558, F Villeurbanne, France.

Université Grenoble Alpes, Inria, CNRS, Grenoble INP, LJK UMR 5224, F Grenoble, France.

出版信息

Bioinformatics. 2019 Oct 15;35(20):4011-4019. doi: 10.1093/bioinformatics/btz177.

Abstract

MOTIVATION

The development of high-throughput single-cell sequencing technologies now allows the investigation of the population diversity of cellular transcriptomes. The expression dynamics (gene-to-gene variability) can be quantified more accurately, thanks to the measurement of lowly expressed genes. In addition, the cell-to-cell variability is high, with a low proportion of cells expressing the same genes at the same time/level. Those emerging patterns appear to be very challenging from the statistical point of view, especially to represent a summarized view of single-cell expression data. Principal component analysis (PCA) is a most powerful tool for high dimensional data representation, by searching for latent directions catching the most variability in the data. Unfortunately, classical PCA is based on Euclidean distance and projections that poorly work in presence of over-dispersed count data with dropout events like single-cell expression data.

RESULTS

We propose a probabilistic Count Matrix Factorization (pCMF) approach for single-cell expression data analysis that relies on a sparse Gamma-Poisson factor model. This hierarchical model is inferred using a variational EM algorithm. It is able to jointly build a low dimensional representation of cells and genes. We show how this probabilistic framework induces a geometry that is suitable for single-cell data visualization, and produces a compression of the data that is very powerful for clustering purposes. Our method is competed against other standard representation methods like t-SNE, and we illustrate its performance for the representation of single-cell expression data.

AVAILABILITY AND IMPLEMENTATION

Our work is implemented in the pCMF R-package (https://github.com/gdurif/pCMF).

SUPPLEMENTARY INFORMATION

Supplementary data are available at Bioinformatics online.

摘要

动机

高通量单细胞测序技术的发展现在允许研究细胞转录组的群体多样性。由于可以测量低表达基因,因此可以更准确地量化表达动态(基因间变异性)。此外,细胞间的变异性很高,很少有细胞同时/以相同的水平表达相同的基因。从统计学的角度来看,这些新出现的模式似乎非常具有挑战性,特别是要代表单细胞表达数据的综合视图。主成分分析(PCA)是一种用于高维数据表示的最强大工具,通过搜索捕获数据中最大变异性的潜在方向。不幸的是,经典的 PCA 基于欧几里得距离和投影,在存在过度分散的计数数据(如单细胞表达数据中的缺失事件)时效果不佳。

结果

我们提出了一种用于单细胞表达数据分析的概率计数矩阵分解(pCMF)方法,该方法依赖于稀疏伽马泊松因子模型。该层次模型使用变分 EM 算法进行推断。它能够联合构建细胞和基因的低维表示。我们展示了这种概率框架如何诱导适合单细胞数据可视化的几何形状,并产生非常适合聚类目的的数据压缩。我们的方法与其他标准表示方法(如 t-SNE)竞争,并说明了其在单细胞表达数据表示方面的性能。

可用性和实现

我们的工作在 pCMF R 包中实现(https://github.com/gdurif/pCMF)。

补充信息

补充数据可在生物信息学在线获得。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验