IEEE Trans Neural Netw Learn Syst. 2024 Aug;35(8):11371-11381. doi: 10.1109/TNNLS.2023.3260003. Epub 2024 Aug 5.
A variety of single-cell RNA-seq (scRNA-seq) clustering methods has achieved great success in discovering cellular phenotypes. However, it remains challenging when the data confounds with batch effects brought by different experimental conditions or technologies. Namely, the data partitions would be biased toward these nonbiological factors. Meanwhile, the batch differences are not always much smaller than true biological variations, hindering the cooperation of batch integration and clustering methods. To overcome this challenge, we propose single-cell RNA-seq debiased clustering (SCDC), an end-to-end clustering method that is debiased toward batch effects by disentangling the biological and nonbiological information from scRNA-seq data during data partitioning. In six analyses, SCDC qualitatively and quantitatively outperforms both the state-of-the-art clustering and batch integration methods in handling scRNA-seq data with batch effects. Furthermore, SCDC clusters data with a linearly increasing running time with respect to cell numbers and a fixed graphics processing unit (GPU) memory consumption, making it scalable to large datasets. The code will be released on Github.
多种单细胞 RNA 测序 (scRNA-seq) 聚类方法在发现细胞表型方面取得了巨大成功。然而,当数据与由不同实验条件或技术带来的批次效应混淆时,仍然具有挑战性。也就是说,数据分区会偏向于这些非生物学因素。同时,批次差异并不总是比真正的生物学变化小很多,这阻碍了批次整合和聚类方法的协作。为了克服这一挑战,我们提出了单细胞 RNA-seq 去偏聚类 (SCDC),这是一种端到端的聚类方法,通过在数据分区过程中从 scRNA-seq 数据中分离生物学和非生物学信息,从而对批次效应进行去偏。在六项分析中,SCDC 在处理具有批次效应的 scRNA-seq 数据方面,在定性和定量上均优于最先进的聚类和批次整合方法。此外,SCDC 的运行时间与细胞数量呈线性增加,图形处理单元 (GPU) 内存消耗固定,因此可扩展到大型数据集。该代码将在 Github 上发布。