Interdisciplinary Ph.D. Program in Biostatistics, Ohio State University, Columbus, Ohio, United State of America.
Department of Molecular Medicine, University of Texas Health Science Center, San Antonio, Texas, United State of America.
PLoS Comput Biol. 2022 Jun 13;18(6):e1010129. doi: 10.1371/journal.pcbi.1010129. eCollection 2022 Jun.
Single cell Hi-C techniques enable one to study cell to cell variability in chromatin interactions. However, single cell Hi-C (scHi-C) data suffer severely from sparsity, that is, the existence of excess zeros due to insufficient sequencing depth. Complicating the matter further is the fact that not all zeros are created equal: some are due to loci truly not interacting because of the underlying biological mechanism (structural zeros); others are indeed due to insufficient sequencing depth (sampling zeros or dropouts), especially for loci that interact infrequently. Differentiating between structural zeros and dropouts is important since correct inference would improve downstream analyses such as clustering and discovery of subtypes. Nevertheless, distinguishing between these two types of zeros has received little attention in the single cell Hi-C literature, where the issue of sparsity has been addressed mainly as a data quality improvement problem. To fill this gap, in this paper, we propose HiCImpute, a Bayesian hierarchical model that goes beyond data quality improvement by also identifying observed zeros that are in fact structural zeros. HiCImpute takes spatial dependencies of scHi-C 2D data structure into account while also borrowing information from similar single cells and bulk data, when such are available. Through an extensive set of analyses of synthetic and real data, we demonstrate the ability of HiCImpute for identifying structural zeros with high sensitivity, and for accurate imputation of dropout values. Downstream analyses using data improved from HiCImpute yielded much more accurate clustering of cell types compared to using observed data or data improved by several comparison methods. Most significantly, HiCImpute-improved data have led to the identification of subtypes within each of the excitatory neuronal cells of L4 and L5 in the prefrontal cortex.
单细胞 Hi-C 技术使人们能够研究染色质相互作用中的细胞间可变性。然而,单细胞 Hi-C(scHi-C)数据严重稀疏,即由于测序深度不足而存在过多的零值。使问题更加复杂的是,并非所有的零值都是平等产生的:有些是由于潜在的生物学机制导致的真实不存在相互作用的区域(结构零值);另一些确实是由于测序深度不足(采样零值或缺失值)造成的,尤其是对于那些很少相互作用的区域。区分结构零值和缺失值很重要,因为正确的推断可以改善下游分析,如聚类和发现亚型。然而,在单细胞 Hi-C 文献中,很少关注区分这两种零值的问题,其中稀疏性问题主要作为数据质量改进问题来解决。为了填补这一空白,在本文中,我们提出了 HiCImpute,这是一种贝叶斯层次模型,通过识别实际上是结构零值的观察零值,超越了数据质量改进。HiCImpute 在考虑 scHi-C 2D 数据结构的空间依赖性的同时,还利用了相似的单细胞和批量数据的信息(如果有的话)。通过对合成和真实数据的广泛分析,我们证明了 HiCImpute 具有高灵敏度识别结构零值的能力,以及准确推断缺失值的能力。使用 HiCImpute 改进后的数据进行下游分析,与使用观察数据或几种比较方法改进后的数据相比,细胞类型的聚类更加准确。最重要的是,HiCImpute 改进后的数据导致了在大脑前额叶皮层 L4 和 L5 的兴奋性神经元细胞中识别出亚型。