以统计学上最大分辨率识别单细胞 RNA-seq 数据中的细胞状态。

Identifying cell states in single-cell RNA-seq data at statistically maximal resolution.

机构信息

Biozentrum, University of Basel and Swiss Institute of Bioinformatics, Basel, Switzerland.

出版信息

PLoS Comput Biol. 2024 Jul 12;20(7):e1012224. doi: 10.1371/journal.pcbi.1012224. eCollection 2024 Jul.

DOI:10.1371/journal.pcbi.1012224

PMID:38995959

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11364423/

Abstract

Single-cell RNA sequencing (scRNA-seq) has become a popular experimental method to study variation of gene expression within a population of cells. However, obtaining an accurate picture of the diversity of distinct gene expression states that are present in a given dataset is highly challenging because of the sparsity of the scRNA-seq data and its inhomogeneous measurement noise properties. Although a vast number of different methods is applied in the literature for clustering cells into subsets with 'similar' expression profiles, these methods generally lack rigorously specified objectives, involve multiple complex layers of normalization, filtering, feature selection, dimensionality-reduction, employ ad hoc measures of distance or similarity between cells, often ignore the known measurement noise properties of scRNA-seq measurements, and include a large number of tunable parameters. Consequently, it is virtually impossible to assign concrete biophysical meaning to the clusterings that result from these methods. Here we address the following problem: Given raw unique molecule identifier (UMI) counts of an scRNA-seq dataset, partition the cells into subsets such that the gene expression states of the cells in each subset are statistically indistinguishable, and each subset corresponds to a distinct gene expression state. That is, we aim to partition cells so as to maximally reduce the complexity of the dataset without removing any of its meaningful structure. We show that, given the known measurement noise structure of scRNA-seq data, this problem is mathematically well-defined and derive its unique solution from first principles. We have implemented this solution in a tool called Cellstates which operates directly on the raw data and automatically determines the optimal partition and cluster number, with zero tunable parameters. We show that, on synthetic datasets, Cellstates almost perfectly recovers optimal partitions. On real data, Cellstates robustly identifies subtle substructure within groups of cells that are traditionally annotated as a common cell type. Moreover, we show that the diversity of gene expression states that Cellstates identifies systematically depends on the tissue of origin and not on technical features of the experiments such as the total number of cells and total UMI count per cell. In addition to the Cellstates tool we also provide a small toolbox of software to place the identified cellstates into a hierarchical tree of higher-order clusters, to identify the most important differentially expressed genes at each branch of this hierarchy, and to visualize these results.

摘要

单细胞 RNA 测序（scRNA-seq）已成为研究细胞群体中基因表达变化的一种流行实验方法。然而，由于 scRNA-seq 数据的稀疏性及其不均匀的测量噪声特性，要准确描绘给定数据集中存在的不同基因表达状态的多样性极具挑战性。尽管文献中应用了大量不同的方法将细胞聚类为具有“相似”表达谱的子集，但这些方法通常缺乏严格指定的目标，涉及多个复杂的归一化、过滤、特征选择、降维层，使用细胞之间的特定距离或相似性度量，通常忽略 scRNA-seq 测量的已知测量噪声特性，并包含大量可调参数。因此，实际上不可能为这些方法产生的聚类赋予具体的物理意义。在这里，我们解决了以下问题：给定 scRNA-seq 数据集的原始唯一分子标识符（UMI）计数，将细胞划分为子集，使得每个子集中的细胞的基因表达状态在统计上不可区分，并且每个子集对应于一个独特的基因表达状态。也就是说，我们的目标是划分细胞，以使数据集的复杂性最大化，而不会去除其任何有意义的结构。我们表明，给定 scRNA-seq 数据的已知测量噪声结构，这个问题在数学上是明确定义的，并从第一性原理推导出其唯一解。我们已经在一个名为 Cellstates 的工具中实现了这个解决方案，它直接作用于原始数据，并自动确定最佳分区和聚类数量，没有任何可调参数。我们表明，在合成数据集上，Cellstates 几乎可以完美地恢复最佳分区。在真实数据上，Cellstates 可以稳健地识别传统上注释为常见细胞类型的细胞群内的细微亚结构。此外，我们表明，Cellstates 识别的基因表达状态多样性系统地取决于组织来源，而不取决于实验的技术特征，例如细胞总数和每个细胞的总 UMI 计数。除了 Cellstates 工具，我们还提供了一个小型软件工具包，用于将识别出的细胞状态放入高阶聚类的层次树中，确定此层次结构中每个分支的最重要差异表达基因，并可视化这些结果。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f74e/11364423/1381fc2e0d95/pcbi.1012224.g001.jpg

相似文献

Identifying cell states in single-cell RNA-seq data at statistically maximal resolution.

PLoS Comput Biol. 2024 Jul 12;20(7):e1012224. doi: 10.1371/journal.pcbi.1012224. eCollection 2024 Jul.

Latent cellular analysis robustly reveals subtle diversity in large-scale single-cell RNA-seq data.

Nucleic Acids Res. 2019 Dec 16;47(22):e143. doi: 10.1093/nar/gkz826.

Improving replicability in single-cell RNA-Seq cell type discovery with Dune.

BMC Bioinformatics. 2024 May 24;25(1):198. doi: 10.1186/s12859-024-05814-6.

Multi-Objective Optimized Fuzzy Clustering for Detecting Cell Clusters from Single-Cell Expression Profiles.

Genes (Basel). 2019 Aug 13;10(8):611. doi: 10.3390/genes10080611.

scZAG: Integrating ZINB-Based Autoencoder with Adaptive Data Augmentation Graph Contrastive Learning for scRNA-seq Clustering.

Int J Mol Sci. 2024 May 29;25(11):5976. doi: 10.3390/ijms25115976.

scBoolSeq: Linking scRNA-seq statistics and Boolean dynamics.

PLoS Comput Biol. 2024 Jul 8;20(7):e1011620. doi: 10.1371/journal.pcbi.1011620. eCollection 2024 Jul.

scRNA-Explorer: An End-user Online Tool for Single Cell RNA-seq Data Analysis Featuring Gene Correlation and Data Filtering.

J Mol Biol. 2024 Sep 1;436(17):168654. doi: 10.1016/j.jmb.2024.168654. Epub 2024 Jun 12.

scBGEDA: deep single-cell clustering analysis via a dual denoising autoencoder with bipartite graph ensemble clustering.

Bioinformatics. 2023 Feb 14;39(2). doi: 10.1093/bioinformatics/btad075.

A Gene Rank Based Approach for Single Cell Similarity Assessment and Clustering.

IEEE/ACM Trans Comput Biol Bioinform. 2021 Mar-Apr;18(2):431-442. doi: 10.1109/TCBB.2019.2931582. Epub 2021 Apr 6.

Clustering scRNA-seq data with the cross-view collaborative information fusion strategy.

Brief Bioinform. 2024 Sep 23;25(6). doi: 10.1093/bib/bbae511.

引用本文的文献

The impact of dropouts in scRNAseq dense neighborhood analysis.

Comput Struct Biotechnol J. 2025 Mar 24;27:1278-1285. doi: 10.1016/j.csbj.2025.03.033. eCollection 2025.

Bgee in 2024: focus on curated single-cell RNA-seq datasets, and query tools.

Nucleic Acids Res. 2025 Jan 6;53(D1):D878-D885. doi: 10.1093/nar/gkae1118.

本文引用的文献

The specious art of single-cell genomics.

PLoS Comput Biol. 2023 Aug 17;19(8):e1011288. doi: 10.1371/journal.pcbi.1011288. eCollection 2023 Aug.

Metacells untangle large and complex single-cell transcriptome networks.

BMC Bioinformatics. 2022 Aug 13;23(1):336. doi: 10.1186/s12859-022-04861-1.

The triumphs and limitations of computational methods for scRNA-seq.

Nat Methods. 2021 Jul;18(7):723-732. doi: 10.1038/s41592-021-01171-x. Epub 2021 Jun 21.

Bayesian inference of gene expression states from single-cell RNA-seq data.

Nat Biotechnol. 2021 Aug;39(8):1008-1016. doi: 10.1038/s41587-021-00875-x. Epub 2021 Apr 29.

Benchmarking single-cell RNA-sequencing protocols for cell atlas projects.

Nat Biotechnol. 2020 Jun;38(6):747-755. doi: 10.1038/s41587-020-0469-4. Epub 2020 Apr 6.

Single-cell RNA-seq clustering: datasets, models, and algorithms.

RNA Biol. 2020 Jun;17(6):765-783. doi: 10.1080/15476286.2020.1728961. Epub 2020 Mar 1.

Feature selection and dimension reduction for single-cell RNA-Seq based on a multinomial model.

Genome Biol. 2019 Dec 23;20(1):295. doi: 10.1186/s13059-019-1861-6.

The art of using t-SNE for single-cell transcriptomics.

Nat Commun. 2019 Nov 28;10(1):5416. doi: 10.1038/s41467-019-13056-x.

MetaCell: analysis of single-cell RNA-seq data using K-nn graph partitions.

Genome Biol. 2019 Oct 11;20(1):206. doi: 10.1186/s13059-019-1812-2.

Single-cell RNA-Seq characterization of anatomically identified OLM interneurons in different transgenic mouse lines.

Eur J Neurosci. 2019 Dec;50(11):3750-3771. doi: 10.1111/ejn.14549. Epub 2019 Sep 5.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

以统计学上最大分辨率识别单细胞 RNA-seq 数据中的细胞状态。

Identifying cell states in single-cell RNA-seq data at statistically maximal resolution.

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献