Delft Bioinformatics Lab, Delft University of Technology, 2628 XE Delft, The Netherlands.
Leiden Computational Biology Center.
Bioinformatics. 2020 Dec 30;36(Suppl_2):i849-i856. doi: 10.1093/bioinformatics/btaa816.
Single cell data measures multiple cellular markers at the single-cell level for thousands to millions of cells. Identification of distinct cell populations is a key step for further biological understanding, usually performed by clustering this data. Dimensionality reduction based clustering tools are either not scalable to large datasets containing millions of cells, or not fully automated requiring an initial manual estimation of the number of clusters. Graph clustering tools provide automated and reliable clustering for single cell data, but suffer heavily from scalability to large datasets.
We developed SCHNEL, a scalable, reliable and automated clustering tool for high-dimensional single-cell data. SCHNEL transforms large high-dimensional data to a hierarchy of datasets containing subsets of data points following the original data manifold. The novel approach of SCHNEL combines this hierarchical representation of the data with graph clustering, making graph clustering scalable to millions of cells. Using seven different cytometry datasets, SCHNEL outperformed three popular clustering tools for cytometry data, and was able to produce meaningful clustering results for datasets of 3.5 and 17.2 million cells within workable time frames. In addition, we show that SCHNEL is a general clustering tool by applying it to single-cell RNA sequencing data, as well as a popular machine learning benchmark dataset MNIST.
Implementation is available on GitHub (https://github.com/biovault/SCHNELpy). All datasets used in this study are publicly available.
Supplementary data are available at Bioinformatics online.
单细胞数据在单细胞水平上测量数千到数百万个细胞的多个细胞标记物。鉴定不同的细胞群是进一步理解生物学的关键步骤,通常通过对这些数据进行聚类来完成。基于降维的聚类工具要么不能扩展到包含数百万个细胞的大型数据集,要么不能完全自动化,需要手动估计初始聚类的数量。基于图的聚类工具为单细胞数据提供了自动和可靠的聚类,但严重受到可扩展性的限制,无法处理大型数据集。
我们开发了 SCHNEL,这是一种可扩展、可靠和自动化的高维单细胞数据聚类工具。SCHNEL 将大型高维数据转换为数据集层次结构,其中包含原始数据流形的数据集的子集。SCHNEL 的新颖方法将数据的层次表示与图聚类相结合,使图聚类能够扩展到数百万个细胞。使用七个不同的流式细胞术数据集,SCHNEL 优于三种流行的流式细胞术数据聚类工具,并且能够在可行的时间框架内为 350 万和 1720 万个细胞的数据集产生有意义的聚类结果。此外,我们还展示了 SCHNEL 通过将其应用于单细胞 RNA 测序数据以及流行的机器学习基准数据集 MNIST,是一种通用的聚类工具。
实施可在 GitHub(https://github.com/biovault/SCHNELpy)上获得。本研究中使用的所有数据集均公开可用。
补充数据可在《生物信息学》在线获取。