Rezaie Narges, Rebboah Elisabeth, Williams Brian A, Liang Heidi Yahan, Reese Fairlie, Balderrama-Gutierrez Gabriela, Dionne Louise A, Reinholdt Laura, Trout Diane, Wold Barbara J, Mortazavi Ali
Department of Developmental and Cell Biology, University of California, Irvine, CA, USA.
Center for Complex Biological Systems, University of California, Irvine, CA, USA.
bioRxiv. 2024 Feb 29:2024.02.26.582178. doi: 10.1101/2024.02.26.582178.
The gene expression profiles of distinct cell types reflect complex genomic interactions among multiple simultaneous biological processes within each cell that can be altered by disease progression as well as genetic background. The identification of these active cellular programs is an open challenge in the analysis of single-cell RNA-seq data. Latent Dirichlet Allocation (LDA) is a generative method used to identify recurring patterns in counts data, commonly referred to as topics that can be used to interpret the state of each cell. However, LDA's interpretability is hindered by several key factors including the hyperparameter selection of the number of topics as well as the variability in topic definitions due to random initialization. We developed Topyfic, a Reproducible LDA (rLDA) package, to accurately infer the identity and activity of cellular programs in single-cell data, providing insights into the relative contributions of each program in individual cells. We apply Topyfic to brain single-cell and single-nucleus datasets of two 5xFAD mouse models of Alzheimer's disease crossed with C57BL6/J or CAST/EiJ mice to identify distinct cell types and states in different cell types such as microglia. We find that 8-month 5xFAD/Cast F1 males show higher level of microglial activation than matching 5xFAD/BL6 F1 males, whereas female mice show similar levels of microglial activation. We show that regulatory genes such as TFs, microRNA host genes, and chromatin regulatory genes alone capture cell types and cell states. Our study highlights how topic modeling with a limited vocabulary of regulatory genes can identify gene expression programs in single-cell data in order to quantify similar and divergent cell states in distinct genotypes.
不同细胞类型的基因表达谱反映了每个细胞内多个同时发生的生物学过程之间复杂的基因组相互作用,这些相互作用会因疾病进展以及遗传背景而改变。在单细胞RNA测序数据分析中,识别这些活跃的细胞程序是一项公开的挑战。潜在狄利克雷分配(LDA)是一种生成方法,用于识别计数数据中的重复模式,通常称为主题,可用于解释每个细胞的状态。然而,LDA的可解释性受到几个关键因素的阻碍,包括主题数量的超参数选择以及由于随机初始化导致的主题定义的变异性。我们开发了Topyfic,一个可重复的LDA(rLDA)软件包,以准确推断单细胞数据中细胞程序的身份和活性,从而深入了解每个程序在单个细胞中的相对贡献。我们将Topyfic应用于与C57BL6/J或CAST/EiJ小鼠杂交的两种阿尔茨海默病5xFAD小鼠模型的脑单细胞和单核数据集,以识别不同细胞类型(如小胶质细胞)中的不同细胞类型和状态。我们发现,8个月大的5xFAD/Cast F1雄性小鼠的小胶质细胞激活水平高于匹配的5xFAD/BL6 F1雄性小鼠,而雌性小鼠的小胶质细胞激活水平相似。我们表明,单独的转录因子、微小RNA宿主基因和染色质调节基因等调节基因可以捕获细胞类型和细胞状态。我们的研究强调了如何使用有限的调节基因词汇进行主题建模,以识别单细胞数据中的基因表达程序,从而量化不同基因型中相似和不同的细胞状态。