通过迭代平滑和自监督判别嵌入对单细胞 RNA 测序数据进行聚类。

Clustering single-cell RNA sequencing data via iterative smoothing and self-supervised discriminative embedding.

机构信息

Shanghai Key Laboratory of New Drug Design, School of Pharmacy, East China University of Science and Technology, Shanghai, 200237, China.

Innovation Center for AI and Drug Discovery, School of Pharmacy, East China Normal University, Shanghai, 200062, China.

出版信息

Oncogene. 2024 Jul;43(29):2279-2292. doi: 10.1038/s41388-024-03074-5. Epub 2024 Jun 4.

DOI:10.1038/s41388-024-03074-5

PMID:38834657

Abstract

Single-cell transcriptome sequencing (scRNA-seq) is a high-throughput technique used to study gene expression at the single-cell level. Clustering analysis is a commonly used method in scRNA-seq data analysis, helping researchers identify cell types and uncover interactions between cells. However, the choice of a robust similarity metric in the clustering procedure is still an open challenge due to the complex underlying structures of the data and the inherent noise in data acquisition. Here, we propose a deep clustering method for scRNA-seq data called scRISE (scRNA-seq Iterative Smoothing and self-supervised discriminative Embedding model) to resolve this challenge. The model consists of two main modules: an iterative smoothing module based on graph autoencoders designed to denoise the data and refine the pairwise similarity in turn to gradually incorporate cell structural features and enrich the data information; and a self-supervised discriminative embedding module with adaptive similarity threshold for partitioning samples into correct clusters. Our approach has shown improved quality of data representation and clustering on seventeen scRNA-seq datasets against a number of state-of-the-art deep learning clustering methods. Furthermore, utilizing the scRISE method in biological analysis against the HNSCC dataset has unveiled 62 informative genes, highlighting their potential roles as therapeutic targets and biomarkers.

摘要

单细胞转录组测序 (scRNA-seq) 是一种高通量技术，用于研究单细胞水平的基因表达。聚类分析是 scRNA-seq 数据分析中常用的方法，有助于研究人员识别细胞类型并揭示细胞之间的相互作用。然而，由于数据的复杂底层结构和数据采集固有的噪声，聚类过程中稳健相似性度量的选择仍然是一个开放的挑战。在这里，我们提出了一种用于 scRNA-seq 数据的深度聚类方法，称为 scRISE（scRNA-seq 迭代平滑和自监督判别嵌入模型），以解决这一挑战。该模型由两个主要模块组成：基于图自动编码器的迭代平滑模块，旨在对数据进行去噪并依次细化成对相似性，从而逐步纳入细胞结构特征并丰富数据信息；以及具有自适应相似性阈值的自监督判别嵌入模块，用于将样本划分为正确的聚类。我们的方法在 17 个 scRNA-seq 数据集上对比了许多最先进的深度学习聚类方法，显示出了改进的数据表示和聚类质量。此外，利用 scRISE 方法对 HNSCC 数据集进行生物学分析，揭示了 62 个有信息的基因，突出了它们作为治疗靶点和生物标志物的潜在作用。