Biozentrum, University of Basel and Swiss Institute of Bioinformatics, Basel, Switzerland.
PLoS Comput Biol. 2024 Jul 12;20(7):e1012224. doi: 10.1371/journal.pcbi.1012224. eCollection 2024 Jul.
Single-cell RNA sequencing (scRNA-seq) has become a popular experimental method to study variation of gene expression within a population of cells. However, obtaining an accurate picture of the diversity of distinct gene expression states that are present in a given dataset is highly challenging because of the sparsity of the scRNA-seq data and its inhomogeneous measurement noise properties. Although a vast number of different methods is applied in the literature for clustering cells into subsets with 'similar' expression profiles, these methods generally lack rigorously specified objectives, involve multiple complex layers of normalization, filtering, feature selection, dimensionality-reduction, employ ad hoc measures of distance or similarity between cells, often ignore the known measurement noise properties of scRNA-seq measurements, and include a large number of tunable parameters. Consequently, it is virtually impossible to assign concrete biophysical meaning to the clusterings that result from these methods. Here we address the following problem: Given raw unique molecule identifier (UMI) counts of an scRNA-seq dataset, partition the cells into subsets such that the gene expression states of the cells in each subset are statistically indistinguishable, and each subset corresponds to a distinct gene expression state. That is, we aim to partition cells so as to maximally reduce the complexity of the dataset without removing any of its meaningful structure. We show that, given the known measurement noise structure of scRNA-seq data, this problem is mathematically well-defined and derive its unique solution from first principles. We have implemented this solution in a tool called Cellstates which operates directly on the raw data and automatically determines the optimal partition and cluster number, with zero tunable parameters. We show that, on synthetic datasets, Cellstates almost perfectly recovers optimal partitions. On real data, Cellstates robustly identifies subtle substructure within groups of cells that are traditionally annotated as a common cell type. Moreover, we show that the diversity of gene expression states that Cellstates identifies systematically depends on the tissue of origin and not on technical features of the experiments such as the total number of cells and total UMI count per cell. In addition to the Cellstates tool we also provide a small toolbox of software to place the identified cellstates into a hierarchical tree of higher-order clusters, to identify the most important differentially expressed genes at each branch of this hierarchy, and to visualize these results.
单细胞 RNA 测序(scRNA-seq)已成为研究细胞群体中基因表达变化的一种流行实验方法。然而,由于 scRNA-seq 数据的稀疏性及其不均匀的测量噪声特性,要准确描绘给定数据集中存在的不同基因表达状态的多样性极具挑战性。尽管文献中应用了大量不同的方法将细胞聚类为具有“相似”表达谱的子集,但这些方法通常缺乏严格指定的目标,涉及多个复杂的归一化、过滤、特征选择、降维层,使用细胞之间的特定距离或相似性度量,通常忽略 scRNA-seq 测量的已知测量噪声特性,并包含大量可调参数。因此,实际上不可能为这些方法产生的聚类赋予具体的物理意义。在这里,我们解决了以下问题:给定 scRNA-seq 数据集的原始唯一分子标识符(UMI)计数,将细胞划分为子集,使得每个子集中的细胞的基因表达状态在统计上不可区分,并且每个子集对应于一个独特的基因表达状态。也就是说,我们的目标是划分细胞,以使数据集的复杂性最大化,而不会去除其任何有意义的结构。我们表明,给定 scRNA-seq 数据的已知测量噪声结构,这个问题在数学上是明确定义的,并从第一性原理推导出其唯一解。我们已经在一个名为 Cellstates 的工具中实现了这个解决方案,它直接作用于原始数据,并自动确定最佳分区和聚类数量,没有任何可调参数。我们表明,在合成数据集上,Cellstates 几乎可以完美地恢复最佳分区。在真实数据上,Cellstates 可以稳健地识别传统上注释为常见细胞类型的细胞群内的细微亚结构。此外,我们表明,Cellstates 识别的基因表达状态多样性系统地取决于组织来源,而不取决于实验的技术特征,例如细胞总数和每个细胞的总 UMI 计数。除了 Cellstates 工具,我们还提供了一个小型软件工具包,用于将识别出的细胞状态放入高阶聚类的层次树中,确定此层次结构中每个分支的最重要差异表达基因,并可视化这些结果。