Thai Christopher, Singh Amartya, Herranz Daniel, Khiabanian Hossein
Rutgers Cancer Institute, Rutgers University, New Brunswick, NJ 08901, USA.
Center for Systems and Computational Biology, Rutgers Cancer Institute, Rutgers University, New Brunswick, NJ 08901, USA.
bioRxiv. 2025 May 10:2025.05.06.652497. doi: 10.1101/2025.05.06.652497.
Defining cell types using unsupervised clustering algorithms based on transcriptional similarity is a powerful application of single-cell RNA sequencing. A single clustering resolution may not yield clusters that represent both broad, well-defined populations and smaller subpopulations simultaneously. Therefore, when cell identities are not known prior to sequencing, robust comparison and annotation of inferred clusters remains a challenge. In this work, we define the distance between single-cell clusters by proposing the use of the average overlap metric to compare ranked lists of differentially expressed genes in a top-weighted manner. We first benchmark our approach in a truth-known dataset comprised of highly similar yet distinct T-cell populations and show that evaluating clusters with average overlap results in a consistent, precise, and biologically meaningful recapitulation of true cell identities. We then apply our approach to data of unsorted mouse thymocytes and characterize stages of T-cell development in the thymus, including minor populations of double-negative (CD4-CD8-) T-cells that are notoriously difficult to confidently detect in unsorted single-cell data. We demonstrate that measuring cluster similarity with average overlap of marker gene rankings enables robust, reproducible characterization of single cells and clarifies biological interpretation of their underlying identities in highly homogeneous populations.
基于转录相似性,使用无监督聚类算法定义细胞类型是单细胞RNA测序的一项强大应用。单一的聚类分辨率可能无法同时产生代表广泛、明确界定的细胞群体和较小亚群的聚类。因此,当在测序之前细胞身份未知时,对推断出的聚类进行可靠的比较和注释仍然是一项挑战。在这项工作中,我们通过提议使用平均重叠度量以加权方式比较差异表达基因的排名列表,来定义单细胞聚类之间的距离。我们首先在一个由高度相似但又不同的T细胞群体组成的已知真值数据集中对我们的方法进行基准测试,并表明用平均重叠来评估聚类会产生对真实细胞身份的一致、精确且具有生物学意义的重现。然后,我们将我们的方法应用于未分选的小鼠胸腺细胞数据,并表征胸腺中T细胞发育的阶段,包括双阴性(CD4-CD8-)T细胞的少数群体,这些群体在未分选的单细胞数据中 notoriously difficult to confidently detect(很难可靠地检测到)。我们证明,用标记基因排名的平均重叠来测量聚类相似性能够对单细胞进行可靠、可重复的表征,并阐明它们在高度同质群体中潜在身份的生物学解释。