School of Computer Science, McGill University, Montreal, QC, Canada.
Harvard-MIT Health Sciences and Technology, Cambridge, MA, USA.
Nat Commun. 2021 Sep 6;12(1):5261. doi: 10.1038/s41467-021-25534-2.
The advent of single-cell RNA sequencing (scRNA-seq) technologies has revolutionized transcriptomic studies. However, large-scale integrative analysis of scRNA-seq data remains a challenge largely due to unwanted batch effects and the limited transferabilty, interpretability, and scalability of the existing computational methods. We present single-cell Embedded Topic Model (scETM). Our key contribution is the utilization of a transferable neural-network-based encoder while having an interpretable linear decoder via a matrix tri-factorization. In particular, scETM simultaneously learns an encoder network to infer cell type mixture and a set of highly interpretable gene embeddings, topic embeddings, and batch-effect linear intercepts from multiple scRNA-seq datasets. scETM is scalable to over 10 cells and confers remarkable cross-tissue and cross-species zero-shot transfer-learning performance. Using gene set enrichment analysis, we find that scETM-learned topics are enriched in biologically meaningful and disease-related pathways. Lastly, scETM enables the incorporation of known gene sets into the gene embeddings, thereby directly learning the associations between pathways and topics via the topic embeddings.
单细胞 RNA 测序 (scRNA-seq) 技术的出现彻底改变了转录组学研究。然而,由于不需要的批次效应以及现有计算方法的有限可转移性、可解释性和可扩展性,大规模整合 scRNA-seq 数据仍然是一个挑战。我们提出了单细胞嵌入式主题模型 (scETM)。我们的主要贡献是利用可转移的基于神经网络的编码器,同时通过矩阵三因子分解实现可解释的线性解码器。具体来说,scETM 同时学习一个编码器网络来推断细胞类型混合物,以及一组高度可解释的基因嵌入、主题嵌入和来自多个 scRNA-seq 数据集的批次效应线性截距。scETM 可扩展到超过 10 个细胞,并具有显著的跨组织和跨物种零样本迁移学习性能。通过基因集富集分析,我们发现 scETM 学习的主题在生物学上有意义和与疾病相关的途径中得到了富集。最后,scETM 能够将已知的基因集纳入基因嵌入中,从而通过主题嵌入直接学习途径和主题之间的关联。