Khoeini Arash, Sar Funda, Lin Yen-Yi, Collins Colin, Ester Martin
School of Computing Science, Simon Fraser University, Burnaby, British Columbia V5A 1S6, Canada.
Vancouver Prostate Centre, Vancouver, British Columbia V6H 3Z6, Canada.
Bioinformatics. 2025 May 6;41(5). doi: 10.1093/bioinformatics/btaf137.
Single-cell RNA sequencing (scRNA-seq) analysis relies heavily on effective clustering to facilitate numerous downstream applications. Although several machine learning methods have been developed to enhance single-cell clustering, most are fully unsupervised and overlook the rich repository of annotated datasets available from previous single-cell experiments. Since cells are inherently high-dimensional entities, unsupervised clustering can often result in clusters that lack biological relevance. Leveraging annotated scRNA-seq datasets as a reference can significantly enhance clustering performance, enabling the identification of biologically meaningful clusters in target datasets.
In this article, we propose Single Cell MUlti-Source CLustering (scMUSCL), a novel transfer learning method designed to identify cell clusters in a target dataset by leveraging knowledge from multiple annotated reference datasets. scMUSCL employs a deep neural network to extract domain- and batch-invariant cell representations, effectively addressing discrepancies across various source datasets and between source and target datasets within the new representation space. Unlike existing methods, scMUSCL does not require prior knowledge of the number of clusters in the target dataset and eliminates the need for batch correction between source and target datasets. We conduct extensive experiments using 20 real-life datasets, demonstrating that scMUSCL consistently outperforms existing unsupervised and transfer learning-based methods. Furthermore, our experiments show that scMUSCL benefits from multiple source datasets as learning references and accurately estimates the number of clusters.
The Python implementation of scMUSCL is available at https://github.com/arashkhoeini/scMUSCL.
单细胞RNA测序(scRNA-seq)分析在很大程度上依赖于有效的聚类来促进众多下游应用。尽管已经开发了几种机器学习方法来增强单细胞聚类,但大多数方法都是完全无监督的,并且忽略了先前单细胞实验中可用的大量注释数据集。由于细胞本质上是高维实体,无监督聚类通常会导致缺乏生物学相关性的聚类。利用注释的scRNA-seq数据集作为参考可以显著提高聚类性能,从而在目标数据集中识别出具有生物学意义的聚类。
在本文中,我们提出了单细胞多源聚类(scMUSCL),这是一种新颖的迁移学习方法,旨在通过利用来自多个注释参考数据集的知识来识别目标数据集中的细胞聚类。scMUSCL采用深度神经网络来提取域和批次不变的细胞表示,有效地解决了新表示空间中各种源数据集之间以及源数据集和目标数据集之间的差异。与现有方法不同,scMUSCL不需要事先知道目标数据集中的聚类数量,并且无需对源数据集和目标数据集进行批次校正。我们使用20个实际数据集进行了广泛的实验,证明scMUSCL始终优于现有的无监督和基于迁移学习的方法。此外,我们的实验表明,scMUSCL受益于多个源数据集作为学习参考,并能准确估计聚类数量。
scMUSCL的Python实现可在https://github.com/arashkhoeini/scMUSCL上获取。