Division of Biostatistics and Department of Statistics,Berkeley, Berkeley, California, USA.
Division of Biostatistics University of California, Berkeley, Berkeley, California, USA.
J Comput Biol. 2022 Aug;29(8):867-879. doi: 10.1089/cmb.2021.0652. Epub 2022 Jul 6.
Unsupervised cell clustering on the basis of meaningful biological variation in single-cell RNA sequencing (scRNA seq) data has received significant attention, as it assists with ontological subpopulation identification among the data. A key step in the clustering process is to compute distances between the cells under a specified distance measure. Although particular distance measures may successfully separate cells into biologically relevant clusters, they may fail to retain global structure of the data, such as relative similarity between the cell clusters. In this article, we modify a biologically motivated distance measure, SIDEseq, for use of aggregate comparisons of cell types in large single-cell assays, and demonstrate that, across simulated and real scRNA seq data, the distance matrix more consistently retains global cell type relationships than commonly used distance measures for scRNA seq clustering. We call the modified distance measure "SIDEREF." We explore spectral dimension reduction of the SIDEREF distance matrix as a means of noise filtering, similar to principal components analysis applied directly to expression data. We utilize a summary measure of relative cell type distances to better display the cell group relationships. SIDEREF visualizations more consistently reflect global structures in the data than other commonly considered distance measures. We utilize relative cell type distances and the SIDEREF distance measure to uncover compositional differences between annotated leukocyte cell groups in a compendium of scRNA seq assays comprising 12 tissues. SIDEREF and associated analysis is openly available on GitHub.
基于单细胞 RNA 测序 (scRNA seq) 数据中有意义的生物学变异对细胞进行无监督聚类受到了广泛关注,因为它有助于在数据中识别生物学亚群。聚类过程的关键步骤是在指定的距离度量下计算细胞之间的距离。虽然特定的距离度量可以成功地将细胞分为生物学上相关的簇,但它们可能无法保留数据的全局结构,例如细胞簇之间的相对相似性。在本文中,我们修改了一种基于生物学的距离度量 SIDEseq,用于在大型单细胞测定中对细胞类型进行综合比较,并证明在模拟和真实的 scRNA seq 数据中,距离矩阵比常用于 scRNA seq 聚类的距离度量更能一致地保留全局细胞类型关系。我们将修改后的距离度量称为“SIDEREF”。我们探索了 SIDEREF 距离矩阵的谱维数约简作为一种降噪方法,类似于直接应用于表达数据的主成分分析。我们利用相对细胞类型距离的综合度量来更好地显示细胞群关系。SIDEREF 可视化比其他常用的距离度量更能一致地反映数据中的全局结构。我们利用相对细胞类型距离和 SIDEREF 距离度量来揭示注释白细胞细胞群在包含 12 种组织的 scRNA seq 测定综合集中的组成差异。SIDEREF 及其相关分析在 GitHub 上公开可用。