School of Computer Science and Engineering, Sun Yat-sen University, Guangzhou 510000, China.
Key Laboratory of Machine Intelligence and Advanced Computing (MOE), Guangzhou 510000, China.
Brief Bioinform. 2022 Mar 10;23(2). doi: 10.1093/bib/bbab570.
Single-cell RNA sequencing (scRNA-seq) techniques provide high-resolution data on cellular heterogeneity in diverse tissues, and a critical step for the data analysis is cell type identification. Traditional methods usually cluster the cells and manually identify cell clusters through marker genes, which is time-consuming and subjective. With the launch of several large-scale single-cell projects, millions of sequenced cells have been annotated and it is promising to transfer labels from the annotated datasets to newly generated datasets. One powerful way for the transferring is to learn cell relations through the graph neural network (GNN), but traditional GNNs are difficult to process millions of cells due to the expensive costs of the message-passing procedure at each training epoch. Here, we have developed a robust and scalable GNN-based method for accurate single-cell classification (GraphCS), where the graph is constructed to connect similar cells within and between labelled and unlabeled scRNA-seq datasets for propagation of shared information. To overcome the slow information propagation of GNN at each training epoch, the diffused information is pre-calculated via the approximate Generalized PageRank algorithm, enabling sublinear complexity over cell numbers. Compared with existing methods, GraphCS demonstrates better performance on simulated, cross-platform, cross-species and cross-omics scRNA-seq datasets. More importantly, our model provides a high speed and scalability on large datasets, and can achieve superior performance for 1 million cells within 50 min.
单细胞 RNA 测序 (scRNA-seq) 技术为不同组织中的细胞异质性提供了高分辨率数据,数据分析的关键步骤是细胞类型识别。传统方法通常通过聚类细胞并手动识别标记基因的细胞簇来实现,这既耗时又主观。随着几个大规模单细胞项目的推出,数以百万计的测序细胞已经被注释,并且有望将标签从已注释的数据集转移到新生成的数据集。一种强大的转移方法是通过图神经网络 (GNN) 学习细胞关系,但传统的 GNN 由于在每个训练时期消息传递过程的昂贵成本,难以处理数百万个细胞。在这里,我们开发了一种基于 GNN 的强大且可扩展的方法来进行准确的单细胞分类 (GraphCS),其中构建了一个图来连接已标记和未标记的 scRNA-seq 数据集内和之间的相似细胞,以传播共享信息。为了克服 GNN 在每个训练时期信息传播缓慢的问题,通过近似广义 PageRank 算法预先计算扩散信息,从而使细胞数量的复杂度呈亚线性。与现有方法相比,GraphCS 在模拟、跨平台、跨物种和跨组学 scRNA-seq 数据集上表现出更好的性能。更重要的是,我们的模型在大型数据集上具有高速和可扩展性,可以在 50 分钟内对 100 万个细胞实现卓越的性能。