Vallejos Catalina A, Marioni John C, Richardson Sylvia
MRC Biostatistics Unit, Cambridge Institute of Public Health, Cambridge, United Kingdom; EMBL European Bioinformatics Institute, Cambridge, United Kingdom.
EMBL European Bioinformatics Institute, Cambridge, United Kingdom.
PLoS Comput Biol. 2015 Jun 24;11(6):e1004333. doi: 10.1371/journal.pcbi.1004333. eCollection 2015 Jun.
Single-cell mRNA sequencing can uncover novel cell-to-cell heterogeneity in gene expression levels in seemingly homogeneous populations of cells. However, these experiments are prone to high levels of unexplained technical noise, creating new challenges for identifying genes that show genuine heterogeneous expression within the population of cells under study. BASiCS (Bayesian Analysis of Single-Cell Sequencing data) is an integrated Bayesian hierarchical model where: (i) cell-specific normalisation constants are estimated as part of the model parameters, (ii) technical variability is quantified based on spike-in genes that are artificially introduced to each analysed cell's lysate and (iii) the total variability of the expression counts is decomposed into technical and biological components. BASiCS also provides an intuitive detection criterion for highly (or lowly) variable genes within the population of cells under study. This is formalised by means of tail posterior probabilities associated to high (or low) biological cell-to-cell variance contributions, quantities that can be easily interpreted by users. We demonstrate our method using gene expression measurements from mouse Embryonic Stem Cells. Cross-validation and meaningful enrichment of gene ontology categories within genes classified as highly (or lowly) variable supports the efficacy of our approach.
单细胞mRNA测序能够揭示看似同质的细胞群体中基因表达水平存在的新的细胞间异质性。然而,这些实验容易产生高水平的无法解释的技术噪声,这为识别在研究的细胞群体中表现出真正异质表达的基因带来了新的挑战。BASiCS(单细胞测序数据的贝叶斯分析)是一种集成的贝叶斯层次模型,其中:(i)细胞特异性归一化常数作为模型参数的一部分进行估计;(ii)基于人工引入到每个分析细胞裂解物中的加标基因对技术变异性进行量化;(iii)表达计数的总变异性被分解为技术和生物学成分。BASiCS还为研究的细胞群体中高度(或低度)可变基因提供了直观的检测标准。这通过与高(或低)生物学细胞间方差贡献相关的尾部后验概率来形式化,这些量很容易被用户解释。我们使用来自小鼠胚胎干细胞的基因表达测量数据来演示我们的方法。交叉验证以及在被分类为高度(或低度)可变的基因内基因本体类别的有意义富集支持了我们方法的有效性。