Li Sijie, Li Yuxi, Sun Yu, Li Yaru, Chen Xiaoyang, Tang Songming, Chen Shengquan
School of Mathematical Sciences and LPMC, Nankai University, Tianjin 300071, China.
Institute of Health Service and Transfusion Medicine, Beijing 100850, China.
Bioinformatics. 2024 Mar 29;40(4). doi: 10.1093/bioinformatics/btae191.
Recent technical advancements in single-cell chromatin accessibility sequencing (scCAS) have brought new insights to the characterization of epigenetic heterogeneity. As single-cell genomics experiments scale up to hundreds of thousands of cells, the demand for computational resources for downstream analysis grows intractably large and exceeds the capabilities of most researchers. Here, we propose EpiCarousel, a tailored Python package based on lazy loading, parallel processing, and community detection for memory- and time-efficient identification of metacells, i.e. the emergence of homogenous cells, in large-scale scCAS data. Through comprehensive experiments on five datasets of various protocols, sample sizes, dimensions, number of cell types, and degrees of cell-type imbalance, EpiCarousel outperformed baseline methods in systematic evaluation of memory usage, computational time, and multiple downstream analyses including cell type identification. Moreover, EpiCarousel executes preprocessing and downstream cell clustering on the atlas-level dataset with 707 043 cells and 1 154 611 peaks within 2 h consuming <75 GB of RAM and provides superior performance for characterizing cell heterogeneity than state-of-the-art methods.
The EpiCarousel software is well-documented and freely available at https://github.com/biox-nku/epicarousel. It can be seamlessly interoperated with extensive scCAS analysis toolkits.
单细胞染色质可及性测序(scCAS)的最新技术进展为表观遗传异质性的表征带来了新见解。随着单细胞基因组学实验规模扩大到数十万细胞,下游分析所需的计算资源需求增长到难以处理的程度,超出了大多数研究人员的能力范围。在此,我们提出了EpiCarousel,这是一个基于惰性加载、并行处理和社区检测的定制Python软件包,用于在大规模scCAS数据中高效地识别元细胞(即同质细胞的出现),同时节省内存和时间。通过对五个不同协议、样本大小、维度、细胞类型数量和细胞类型不平衡程度的数据集进行全面实验,EpiCarousel在内存使用、计算时间以及包括细胞类型识别在内的多个下游分析的系统评估中优于基线方法。此外,EpiCarousel在2小时内对包含707043个细胞和1154611个峰的图谱级数据集执行预处理和下游细胞聚类,内存消耗<75GB,并且在表征细胞异质性方面比现有方法具有更优的性能。
EpiCarousel软件文档完善,可在https://github.com/biox-nku/epicarousel上免费获取。它可以与广泛的scCAS分析工具包无缝互操作。