无监督评估细胞身份的聚类成员的统计显著性。

Statistical significance of cluster membership for unsupervised evaluation of cell identities.

机构信息

Institute of Informatics, Faculty of Mathematics, Informatics, and Mechanics, University of Warsaw, Warsaw 02-097, Poland.

NHLBI Integrated Cardiovascular Data Science Training Program, University of California, Los Angeles, CA 90095, USA.

出版信息

Bioinformatics. 2020 May 1;36(10):3107-3114. doi: 10.1093/bioinformatics/btaa087.

DOI:10.1093/bioinformatics/btaa087

PMID:32142108

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC7214036/

Abstract

MOTIVATION

Single-cell RNA-sequencing (scRNA-seq) allows us to dissect transcriptional heterogeneity arising from cellular types, spatio-temporal contexts and environmental stimuli. Transcriptional heterogeneity may reflect phenotypes and molecular signatures that are often unmeasured or unknown a priori. Cell identities of samples derived from heterogeneous subpopulations are then determined by clustering of scRNA-seq data. These cell identities are used in downstream analyses. How can we examine if cell identities are accurately inferred? Unlike external measurements or labels for single cells, using clustering-based cell identities result in spurious signals and false discoveries.

RESULTS

We introduce non-parametric methods to evaluate cell identities by testing cluster memberships in an unsupervised manner. Diverse simulation studies demonstrate accuracy of the jackstraw test for cluster membership. We propose a posterior probability that a cell should be included in that clustering-based subpopulation. Posterior inclusion probabilities (PIPs) for cluster memberships can be used to select and visualize samples relevant to subpopulations. The proposed methods are applied on three scRNA-seq datasets. First, a mixture of Jurkat and 293T cell lines provides two distinct cellular populations. Second, Cell Hashing yields cell identities corresponding to eight donors which are independently analyzed by the jackstraw. Third, peripheral blood mononuclear cells are used to explore heterogeneous immune populations. The proposed P-values and PIPs lead to probabilistic feature selection of single cells that can be visualized using principal component analysis (PCA), t-distributed stochastic neighbor embedding (t-SNE) and others. By learning uncertainty in clustering high-dimensional data, the proposed methods enable unsupervised evaluation of cluster membership.

AVAILABILITY AND IMPLEMENTATION

https://cran.r-project.org/package=jackstraw.

SUPPLEMENTARY INFORMATION

Supplementary data are available at Bioinformatics online.

摘要

动机

单细胞 RNA 测序 (scRNA-seq) 使我们能够剖析细胞类型、时空背景和环境刺激引起的转录异质性。转录异质性可能反映了通常无法测量或事先未知的表型和分子特征。然后通过对 scRNA-seq 数据进行聚类来确定来自异质亚群样本的细胞身份。这些细胞身份用于下游分析。我们如何检查细胞身份是否被准确推断？与单细胞的外部测量或标签不同，基于聚类的细胞身份会产生虚假信号和错误发现。

结果

我们介绍了非参数方法，通过以无监督的方式测试聚类成员来评估细胞身份。各种模拟研究表明，jackstraw 测试对聚类成员的准确性。我们提出了一个细胞应该包含在基于聚类的亚群中的后验概率。聚类成员的后验包含概率 (PIP) 可用于选择和可视化与亚群相关的样本。所提出的方法应用于三个 scRNA-seq 数据集。首先，Jurkat 和 293T 细胞系的混合物提供了两个截然不同的细胞群体。其次，Cell Hashing 产生了对应于八个供体的细胞身份，这些供体独立地由 jackstraw 进行分析。第三，外周血单核细胞用于探索异质免疫群体。所提出的 P 值和 PIP 导致了单细胞的概率特征选择，可以使用主成分分析 (PCA)、t 分布随机邻域嵌入 (t-SNE) 等进行可视化。通过学习高维数据聚类中的不确定性，所提出的方法能够对聚类成员进行无监督评估。

可用性和实现

https://cran.r-project.org/package=jackstraw。

补充信息

补充数据可在 Bioinformatics 在线获得。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9929/7214036/a114d0647a3a/btaa087f1.jpg

相似文献

Statistical significance of cluster membership for unsupervised evaluation of cell identities.无监督评估细胞身份的聚类成员的统计显著性。

Bioinformatics. 2020 May 1;36(10):3107-3114. doi: 10.1093/bioinformatics/btaa087.

Visualization of Single Cell RNA-Seq Data Using t-SNE in R.使用 R 中的 t-SNE 可视化单细胞 RNA-Seq 数据。

Methods Mol Biol. 2020;2117:159-167. doi: 10.1007/978-1-0716-0301-7_8.

Visualizing Single-Cell RNA-seq Data with Semisupervised Principal Component Analysis.基于半监督主成分分析的单细胞 RNA-seq 数据可视化

Int J Mol Sci. 2020 Aug 12;21(16):5797. doi: 10.3390/ijms21165797.

Polled Digital Cell Sorter (p-DCS): Automatic identification of hematological cell types from single cell RNA-sequencing clusters.流式数字细胞分选仪（p-DCS）：从单细胞 RNA 测序簇中自动识别血细胞类型。

BMC Bioinformatics. 2019 Jul 1;20(1):369. doi: 10.1186/s12859-019-2951-x.

A multitask clustering approach for single-cell RNA-seq analysis in Recessive Dystrophic Epidermolysis Bullosa.一种用于隐性营养不良型大疱性表皮松解症的单细胞 RNA-seq 分析的多任务聚类方法。

PLoS Comput Biol. 2018 Apr 9;14(4):e1006053. doi: 10.1371/journal.pcbi.1006053. eCollection 2018 Apr.

Joint learning dimension reduction and clustering of single-cell RNA-sequencing data.单细胞 RNA 测序数据的联合降维和聚类学习。

Bioinformatics. 2020 Jun 1;36(12):3825-3832. doi: 10.1093/bioinformatics/btaa231.

DIMM-SC: a Dirichlet mixture model for clustering droplet-based single cell transcriptomic data.DIMM-SC：一种基于 Dirichlet 混合模型的用于聚类基于液滴的单细胞转录组学数据的方法。

Bioinformatics. 2018 Jan 1;34(1):139-146. doi: 10.1093/bioinformatics/btx490.

SinNLRR: a robust subspace clustering method for cell type detection by non-negative and low-rank representation.SinNLRR：一种基于非负低秩表示的稳健子空间聚类方法，用于细胞类型检测。

Bioinformatics. 2019 Oct 1;35(19):3642-3650. doi: 10.1093/bioinformatics/btz139.

Preprocessing of Single Cell RNA Sequencing Data Using Correlated Clustering and Projection.单细胞 RNA 测序数据的相关聚类和投影预处理。

J Chem Inf Model. 2024 Apr 8;64(7):2829-2838. doi: 10.1021/acs.jcim.3c00674. Epub 2023 Jul 4.

Supervised capacity preserving mapping: a clustering guided visualization method for scRNA-seq data.监督容量保持映射：一种基于聚类的 scRNA-seq 数据可视化方法。

Bioinformatics. 2022 Apr 28;38(9):2496-2503. doi: 10.1093/bioinformatics/btac131.

引用本文的文献

Addressing persistent challenges in digital image analysis of cancer tissue: resources developed from a hackathon.应对癌症组织数字图像分析中的持续挑战：源自黑客马拉松的资源

Mol Oncol. 2025 Jun;19(6):1565-1581. doi: 10.1002/1878-0261.13783. Epub 2025 Feb 10.

Epithelial and immune transcriptomic characteristics and possible regulatory mechanisms in asthma exacerbation: insights from integrated studies.哮喘急性加重期的上皮和免疫转录组特征及可能的调控机制：综合研究的见解

Front Immunol. 2025 Jan 23;16:1512053. doi: 10.3389/fimmu.2025.1512053. eCollection 2025.

Selective inference for -means clustering.均值聚类的选择性推断。

J Mach Learn Res. 2023 May;24.

SURGE: uncovering context-specific genetic-regulation of gene expression from single-cell RNA sequencing using latent-factor models.SURGE：使用潜在因子模型从单细胞 RNA 测序中揭示特定于上下文的基因表达的遗传调控。

Genome Biol. 2024 Jan 22;25(1):28. doi: 10.1186/s13059-023-03152-z.

Multi-omics characteristics of tumor-associated macrophages in the tumor microenvironment of gastric cancer and their exploration of immunotherapy potential.肿瘤微环境中胃癌相关巨噬细胞的多组学特征及其免疫治疗潜力的探索。

Sci Rep. 2023 Oct 25;13(1):18265. doi: 10.1038/s41598-023-38822-2.

FDX1 regulates cellular protein lipoylation through direct binding to LIAS.FDX1通过直接结合LIAS来调节细胞蛋白脂酰化。

bioRxiv. 2023 Feb 4:2023.02.03.526472. doi: 10.1101/2023.02.03.526472.

Inference after latent variable estimation for single-cell RNA sequencing data.单细胞 RNA 测序数据中潜在变量估计后的推断。

Biostatistics. 2023 Dec 15;25(1):270-287. doi: 10.1093/biostatistics/kxac047.

Graphia: A platform for the graph-based visualisation and analysis of high dimensional data.Graphia：一个基于图形的高维数据可视化和分析平台。

PLoS Comput Biol. 2022 Jul 25;18(7):e1010310. doi: 10.1371/journal.pcbi.1010310. eCollection 2022 Jul.

Analyzing Spatial Transcriptomics Data Using Giotto.使用 Giotto 分析空间转录组学数据。

Curr Protoc. 2022 Apr;2(4):e405. doi: 10.1002/cpz1.405.

Saturation variant interpretation using CRISPR prime editing.使用 CRISPR 先导编辑进行饱和变异解读。

Nat Biotechnol. 2022 Jun;40(6):885-895. doi: 10.1038/s41587-021-01201-1. Epub 2022 Feb 21.

本文引用的文献

Jaccard/Tanimoto similarity test and estimation methods for biological presence-absence data.Jaccard/Tanimoto 相似性检验及其在生物存在-缺失数据中的估计方法。

BMC Bioinformatics. 2019 Dec 24;20(Suppl 15):644. doi: 10.1186/s12859-019-3118-5.

DoubletFinder: Doublet Detection in Single-Cell RNA Sequencing Data Using Artificial Nearest Neighbors.DoubletFinder：基于人工最近邻算法检测单细胞 RNA 测序数据中的双细胞。

Cell Syst. 2019 Apr 24;8(4):329-337.e4. doi: 10.1016/j.cels.2019.03.003. Epub 2019 Apr 3.

Cell Hashing with barcoded antibodies enables multiplexing and doublet detection for single cell genomics.细胞条码抗体标记技术可实现单细胞基因组学的多重检测和双细胞检测。

Genome Biol. 2018 Dec 19;19(1):224. doi: 10.1186/s13059-018-1603-1.

SAFE-clustering: Single-cell Aggregated (from Ensemble) clustering for single-cell RNA-seq data.SAFE-clustering：单细胞 RNA-seq 数据的单细胞聚集（来自集成）聚类。

Bioinformatics. 2019 Apr 15;35(8):1269-1277. doi: 10.1093/bioinformatics/bty793.

Integrating single-cell transcriptomic data across different conditions, technologies, and species.整合不同条件、技术和物种的单细胞转录组数据。

Nat Biotechnol. 2018 Jun;36(5):411-420. doi: 10.1038/nbt.4096. Epub 2018 Apr 2.

Missing data and technical variability in single-cell RNA-sequencing experiments.单细胞 RNA 测序实验中的数据缺失和技术变异性。

Biostatistics. 2018 Oct 1;19(4):562-578. doi: 10.1093/biostatistics/kxx053.

Splatter: simulation of single-cell RNA sequencing data.Splatter：单细胞 RNA 测序数据模拟。

Genome Biol. 2017 Sep 12;18(1):174. doi: 10.1186/s13059-017-1305-0.

Identifying cell populations with scRNASeq.单细胞 RNA 测序鉴定细胞群体。

Mol Aspects Med. 2018 Feb;59:114-122. doi: 10.1016/j.mam.2017.07.002. Epub 2017 Jul 25.

SC3: consensus clustering of single-cell RNA-seq data.SC3：单细胞RNA测序数据的一致性聚类

Nat Methods. 2017 May;14(5):483-486. doi: 10.1038/nmeth.4236. Epub 2017 Mar 27.

Visualization and analysis of single-cell RNA-seq data by kernel-based similarity learning.基于核函数相似性学习的单细胞 RNA-seq 数据可视化与分析。

Nat Methods. 2017 Apr;14(4):414-416. doi: 10.1038/nmeth.4207. Epub 2017 Mar 6.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

无监督评估细胞身份的聚类成员的统计显著性。

Statistical significance of cluster membership for unsupervised evaluation of cell identities.

机构信息

出版信息

MOTIVATION

RESULTS

AVAILABILITY AND IMPLEMENTATION

SUPPLEMENTARY INFORMATION

动机

结果

可用性和实现

补充信息

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献