Prabhakaran Sandhya, Azizi Elham, Carr Ambrose, Pe'er Dana
Departments of Biological Sciences, Systems Biology and Computer Science, Columbia University, New York, NY, USA.
JMLR Workshop Conf Proc. 2016;48:1070-1079.
We introduce an iterative normalization and clustering method for single-cell gene expression data. The emerging technology of single-cell RNA-seq gives access to gene expression measurements for thousands of cells, allowing discovery and characterization of cell types. However, the data is confounded by technical variation emanating from experimental errors and cell type-specific biases. Current approaches perform a global normalization prior to analyzing biological signals, which does not resolve missing data or variation dependent on latent cell types. Our model is formulated as a hierarchical Bayesian mixture model with cell-specific scalings that aid the iterative normalization and clustering of cells, teasing apart technical variation from biological signals. We demonstrate that this approach is superior to global normalization followed by clustering. We show identifiability and weak convergence guarantees of our method and present a scalable Gibbs inference algorithm. This method improves cluster inference in both synthetic and real single-cell data compared with previous methods, and allows easy interpretation and recovery of the underlying structure and cell types.
我们介绍了一种用于单细胞基因表达数据的迭代归一化和聚类方法。新兴的单细胞RNA测序技术能够获取数千个细胞的基因表达测量值,从而有助于发现和表征细胞类型。然而,数据受到实验误差和细胞类型特异性偏差所产生的技术变异的影响。当前方法在分析生物信号之前进行全局归一化,这无法解决缺失数据或依赖潜在细胞类型的变异问题。我们的模型被构建为一个具有细胞特异性缩放的分层贝叶斯混合模型,有助于细胞的迭代归一化和聚类,将技术变异与生物信号区分开来。我们证明这种方法优于先进行聚类再进行全局归一化的方法。我们展示了我们方法的可识别性和弱收敛保证,并提出了一种可扩展的吉布斯推理算法。与先前方法相比,该方法在合成和真实单细胞数据中都改进了聚类推理,并且能够轻松解释和恢复潜在结构及细胞类型。