Yuan Musu, Chen Liang, Deng Minghua
Center for Quantitative Biology, Academy for Advanced Interdisciplinary Studies, Peking University, Beijing, China.
Department of Probability and Statistics, School of Mathematical Sciences, Peking University, Beijing, China.
Front Genet. 2022 Aug 22;13:977968. doi: 10.3389/fgene.2022.977968. eCollection 2022.
Single-cell multiomics sequencing techniques have rapidly developed in the past few years. Among these techniques, single-cell cellular indexing of transcriptomes and epitopes (CITE-seq) allows simultaneous quantification of gene expression and surface proteins. Clustering CITE-seq data have the great potential of providing us with a more comprehensive and in-depth view of cell states and interactions. However, CITE-seq data inherit the properties of scRNA-seq data, being noisy, large-dimensional, and highly sparse. Moreover, representations of RNA and surface protein are sometimes with low correlation and contribute divergently to the clustering object. To overcome these obstacles and find a combined representation well suited for clustering, we proposed scCTClust for multiomics data, especially CITE-seq data, and clustering analysis. Two omics-specific neural networks are introduced to extract cluster information from omics data. A deep canonical correlation method is adopted to find the maximumly correlated representations of two omics. A novel decentralized clustering method is utilized over the linear combination of latent representations of two omics. The fusion weights which can account for contributions of omics to clustering are adaptively updated during training. Extensive experiments over both simulated and real CITE-seq data sets demonstrated the power of scCTClust. We also applied scCTClust on transcriptome-epigenome data to illustrate its potential for generalizing.
单细胞多组学测序技术在过去几年中迅速发展。在这些技术中,转录组和表位的单细胞细胞索引(CITE-seq)允许同时对基因表达和表面蛋白进行定量。对CITE-seq数据进行聚类有很大潜力为我们提供更全面、深入的细胞状态和相互作用视图。然而,CITE-seq数据继承了scRNA-seq数据的特性,即有噪声、高维且高度稀疏。此外,RNA和表面蛋白的表示有时相关性较低,对聚类目标的贡献也不同。为了克服这些障碍并找到适合聚类的组合表示,我们提出了用于多组学数据(特别是CITE-seq数据)和聚类分析的scCTClust。引入了两个特定于组学的神经网络从组学数据中提取聚类信息。采用深度典型相关方法找到两个组学的最大相关表示。在两个组学的潜在表示的线性组合上使用一种新颖的分散聚类方法。在训练过程中自适应更新能够解释组学对聚类贡献的融合权重。在模拟和真实CITE-seq数据集上进行的大量实验证明了scCTClust的强大功能。我们还将scCTClust应用于转录组-表观基因组数据以说明其泛化潜力。