CSI-GEP:一种基于GPU的无监督机器学习方法,用于在图谱尺度单细胞RNA测序数据中恢复基因表达程序。
CSI-GEP: A GPU-based unsupervised machine learning approach for recovering gene expression programs in atlas-scale single-cell RNA-seq data.
作者信息
Liu Xueying, Chapple Richard H, Bennett Declan, Wright William C, Sanjali Ankita, Culp Erielle, Zhang Yinwen, Pan Min, Geeleher Paul
机构信息
Department of Computational Biology, St. Jude Children's Research Hospital, Memphis, TN 38105, USA.
Department of Computational Biology, St. Jude Children's Research Hospital, Memphis, TN 38105, USA; Department of Genetics, Genomics, and Informatics, University of Tennessee Health Science Center, Memphis, TN 38163, USA.
出版信息
Cell Genom. 2025 Jan 8;5(1):100739. doi: 10.1016/j.xgen.2024.100739.
Exploratory analysis of single-cell RNA sequencing (scRNA-seq) typically relies on hard clustering over two-dimensional projections like uniform manifold approximation and projection (UMAP). However, such methods can severely distort the data and have many arbitrary parameter choices. Methods that can model scRNA-seq data as non-discrete "gene expression programs" (GEPs) can better preserve the data's structure, but currently, they are often not scalable, not consistent across repeated runs, and lack an established method for choosing key parameters. Here, we developed a GPU-based unsupervised learning approach, "consensus and scalable inference of gene expression programs" (CSI-GEP). We show that CSI-GEP can recover ground truth GEPs in real and simulated atlas-scale scRNA-seq datasets, significantly outperforming cutting-edge methods, including GPT-based neural networks. We applied CSI-GEP to a whole mouse brain atlas of 2.2 million cells, disentangling endothelial cell types missed by other methods, and to an integrated scRNA-seq atlas of human tumors and cell lines, discovering mesenchymal-like GEPs unique to cancer cells growing in culture.
单细胞RNA测序(scRNA-seq)的探索性分析通常依赖于在二维投影(如均匀流形近似和投影,UMAP)上进行硬聚类。然而,此类方法可能会严重扭曲数据,并且有许多任意的参数选择。能够将scRNA-seq数据建模为非离散“基因表达程序”(GEP)的方法可以更好地保留数据结构,但目前,它们通常不可扩展,在重复运行中不一致,并且缺乏选择关键参数的既定方法。在此,我们开发了一种基于GPU的无监督学习方法,即“基因表达程序的共识与可扩展推断”(CSI-GEP)。我们表明,CSI-GEP可以在真实和模拟的图谱规模scRNA-seq数据集中恢复真实的GEP,显著优于包括基于GPT的神经网络在内的前沿方法。我们将CSI-GEP应用于一个包含220万个细胞的全小鼠脑图谱,解开了其他方法遗漏的内皮细胞类型,并应用于人类肿瘤和细胞系的综合scRNA-seq图谱,发现了培养中生长的癌细胞特有的间充质样GEP。