ICTEAM/INGI/Artificial Intelligence and Algorithms Group, UCLouvain, Louvain-la-Neuve 1348, Belgium.
Bioinformatics. 2024 Jun 3;40(6). doi: 10.1093/bioinformatics/btae371.
Identifying rare cell types is an important task to capture the heterogeneity of single-cell data, such as scRNA-seq. The widespread availability of such data enables to aggregate multiple samples, corresponding for example to different donors, into the same study. Yet, such aggregated data is often subject to batch effects between samples. Clustering it therefore generally requires the use of data integration methods, which can lead to overcorrection, making the identification of rare cells difficult. We present scCross, a biclustering method identifying rare subpopulations of cells present across multiple single-cell samples. It jointly identifies a group of cells with specific marker genes by relying on a global sum criterion, computed over entire subpopulation of cells, rather than pairwise comparisons between individual cells. This proves robust with respect to the high variability of scRNA-seq data, in particular batch effects.
We show through several case studies that scCross is able to identify rare subpopulations across multiple samples without performing prior data integration. Namely, it identifies a cilium subpopulation with potential new ciliary genes from lung cancer cells, which is not detected by typical alternatives. It also highlights rare subpopulations in human pancreas samples sequenced with different protocols, despite visible shifts in expression levels between batches. We further show that scCross outperforms typical alternatives at identifying a target rare cell type in a controlled experiment with artificially created batch effects. This shows the ability of scCross to efficiently identify rare cell subpopulations characterized by specific genes despite the presence of batch effects.
The R and Scala implementation of scCross is freely available on GitHub, at https://github.com/agerniers/scCross/. A snapshot of the code and the data underlying this article are available on Zenodo, at https://zenodo.org/doi/10.5281/zenodo.10471063.
鉴定稀有细胞类型是捕获单细胞数据异质性的一项重要任务,例如 scRNA-seq。这种数据的广泛可获取性使得能够将多个样本(例如对应于不同供体)聚合到同一研究中。然而,这种聚合数据通常受到样本之间批次效应的影响。因此,对其进行聚类通常需要使用数据集成方法,这可能导致过度校正,从而难以鉴定稀有细胞。我们提出了 scCross,这是一种用于鉴定跨多个单细胞样本存在的稀有细胞亚群的双聚类方法。它通过依赖于全局求和标准来共同鉴定具有特定标记基因的细胞群,该标准是在整个细胞亚群上计算的,而不是在个体细胞之间进行两两比较。这在 scRNA-seq 数据的高度可变性(特别是批次效应)方面表现出稳健性。
我们通过几个案例研究表明,scCross 能够在不进行先前数据集成的情况下跨多个样本鉴定稀有亚群。具体来说,它从肺癌细胞中鉴定出具有潜在新纤毛基因的纤毛亚群,而这是典型替代方法无法检测到的。它还突出显示了用不同方案测序的人类胰腺样本中的稀有亚群,尽管批次之间的表达水平存在明显变化。我们进一步表明,scCross 在具有人工创建的批次效应的受控实验中鉴定目标稀有细胞类型的性能优于典型替代方法。这表明 scCross 能够有效地鉴定具有特定基因特征的稀有细胞亚群,尽管存在批次效应。
scCross 的 R 和 Scala 实现可在 GitHub 上免费获得,网址为 https://github.com/agerniers/scCross/。本文的代码和数据快照可在 Zenodo 上获得,网址为 https://zenodo.org/doi/10.5281/zenodo.10471063。