Dey Kushal K, Hsiao Chiaowen Joyce, Stephens Matthew
Department of Statistics, University of Chicago, Chicago, Illinois, United States of America.
Department of Human Genetics, University of Chicago, Chicago, Illinois, United States of America.
PLoS Genet. 2017 Mar 23;13(3):e1006599. doi: 10.1371/journal.pgen.1006599. eCollection 2017 Mar.
Grade of membership models, also known as "admixture models", "topic models" or "Latent Dirichlet Allocation", are a generalization of cluster models that allow each sample to have membership in multiple clusters. These models are widely used in population genetics to model admixed individuals who have ancestry from multiple "populations", and in natural language processing to model documents having words from multiple "topics". Here we illustrate the potential for these models to cluster samples of RNA-seq gene expression data, measured on either bulk samples or single cells. We also provide methods to help interpret the clusters, by identifying genes that are distinctively expressed in each cluster. By applying these methods to several example RNA-seq applications we demonstrate their utility in identifying and summarizing structure and heterogeneity. Applied to data from the GTEx project on 53 human tissues, the approach highlights similarities among biologically-related tissues and identifies distinctively-expressed genes that recapitulate known biology. Applied to single-cell expression data from mouse preimplantation embryos, the approach highlights both discrete and continuous variation through early embryonic development stages, and highlights genes involved in a variety of relevant processes-from germ cell development, through compaction and morula formation, to the formation of inner cell mass and trophoblast at the blastocyst stage. The methods are implemented in the Bioconductor package CountClust.
成员等级模型,也被称为“混合模型”“主题模型”或“潜在狄利克雷分配”,是聚类模型的一种泛化,它允许每个样本属于多个聚类。这些模型在群体遗传学中被广泛用于对具有多个“群体”祖先的混合个体进行建模,在自然语言处理中用于对包含多个“主题”词汇的文档进行建模。在这里,我们展示了这些模型对批量样本或单细胞上测量的RNA测序基因表达数据样本进行聚类的潜力。我们还提供了一些方法,通过识别在每个聚类中特异性表达的基因来帮助解释聚类结果。通过将这些方法应用于几个RNA测序示例应用中,我们证明了它们在识别和总结结构及异质性方面的实用性。将该方法应用于GTEx项目中53种人体组织的数据时,该方法突出了生物学相关组织之间的相似性,并识别出能够概括已知生物学特征的特异性表达基因。将该方法应用于小鼠植入前胚胎的单细胞表达数据时,该方法突出了早期胚胎发育阶段的离散和连续变化,并突出了参与各种相关过程的基因——从生殖细胞发育,到致密化和桑椹胚形成,再到囊胚阶段内细胞团和滋养层的形成。这些方法在Bioconductor软件包CountClust中实现。