Bioinformatics Group, Department of Computer Science, University of Freiburg, 79110 Freiburg, Germany.
Signalling Research Centre CIBSS, University of Freiburg, 79104 Freiburg, Germany.
Bioinformatics. 2021 Nov 18;37(22):4006-4013. doi: 10.1093/bioinformatics/btab394.
Hi-C technology provides insights into the 3D organization of the chromatin, and the single-cell Hi-C method enables researchers to gain knowledge about the chromatin state in individual cell levels. Single-cell Hi-C interaction matrices are high dimensional and very sparse. To cluster thousands of single-cell Hi-C interaction matrices, they are flattened and compiled into one matrix. Depending on the resolution, this matrix can have a few million or even billions of features; therefore, computations can be memory intensive. We present a single-cell Hi-C clustering approach using an approximate nearest neighbors method based on locality-sensitive hashing to reduce the dimensions and the computational resources.
The presented method can process a 10 kb single-cell Hi-C dataset with 2600 cells and needs 40 GB of memory, while competitive approaches are not computable even with 1 TB of memory. It can be shown that the differentiation of the cells by their chromatin folding properties and, therefore, the quality of the clustering of single-cell Hi-C data is advantageous compared to competitive algorithms.
The presented clustering algorithm is part of the scHiCExplorer, is available on Github https://github.com/joachimwolff/scHiCExplorer, and as a conda package via the bioconda channel. The approximate nearest neighbors implementation is available via https://github.com/joachimwolff/sparse-neighbors-search and as a conda package via the bioconda channel.
Supplementary data are available at Bioinformatics online.
Hi-C 技术提供了对染色质 3D 结构的深入了解,而单细胞 Hi-C 方法使研究人员能够了解单个细胞水平的染色质状态。单细胞 Hi-C 相互作用矩阵是高维且非常稀疏的。为了对数千个单细胞 Hi-C 相互作用矩阵进行聚类,将它们展平并编译到一个矩阵中。根据分辨率的不同,这个矩阵可能有几百万甚至几十亿个特征,因此计算可能会占用大量内存。我们提出了一种使用基于局部敏感哈希的近似最近邻方法的单细胞 Hi-C 聚类方法,以降低维度和计算资源。
所提出的方法可以处理具有 2600 个细胞的 10kb 单细胞 Hi-C 数据集,需要 40GB 的内存,而竞争方法即使使用 1TB 的内存也无法计算。可以表明,通过其染色质折叠特性对细胞进行区分,因此与竞争算法相比,单细胞 Hi-C 数据的聚类质量具有优势。
所提出的聚类算法是 scHiCExplorer 的一部分,可在 Github 上获得 https://github.com/joachimwolff/scHiCExplorer,并可通过 bioconda 频道作为 conda 包获得。近似最近邻实现可通过 https://github.com/joachimwolff/sparse-neighbors-search 获得,并可通过 bioconda 频道作为 conda 包获得。
补充数据可在 Bioinformatics 在线获得。