Google Research, Brain team, 75009, Paris, France.
Translational Research Department, Institut Curie, PSL Research University, 75005, Paris, France.
Genome Biol. 2023 Jun 20;24(1):143. doi: 10.1186/s13059-023-02981-2.
Single-cell histone post translational modification (scHPTM) assays such as scCUT&Tag or scChIP-seq allow single-cell mapping of diverse epigenomic landscapes within complex tissues and are likely to unlock our understanding of various mechanisms involved in development or diseases. Running scHTPM experiments and analyzing the data produced remains challenging since few consensus guidelines currently exist regarding good practices for experimental design and data analysis pipelines.
We perform a computational benchmark to assess the impact of experimental parameters and data analysis pipelines on the ability of the cell representation to recapitulate known biological similarities. We run more than ten thousand experiments to systematically study the impact of coverage and number of cells, of the count matrix construction method, of feature selection and normalization, and of the dimension reduction algorithm used. This allows us to identify key experimental parameters and computational choices to obtain a good representation of single-cell HPTM data. We show in particular that the count matrix construction step has a strong influence on the quality of the representation and that using fixed-size bin counts outperforms annotation-based binning. Dimension reduction methods based on latent semantic indexing outperform others, and feature selection is detrimental, while keeping only high-quality cells has little influence on the final representation as long as enough cells are analyzed.
This benchmark provides a comprehensive study on how experimental parameters and computational choices affect the representation of single-cell HPTM data. We propose a series of recommendations regarding matrix construction, feature and cell selection, and dimensionality reduction algorithms.
单细胞组蛋白翻译后修饰(scHPTM)检测,如 scCUT&Tag 或 scChIP-seq,可实现复杂组织中多种表观基因组图谱的单细胞映射,有望帮助我们理解发育或疾病过程中涉及的各种机制。由于目前针对实验设计和数据分析流程的良好实践,几乎没有共识指南,因此运行 scHTPM 实验和分析产生的数据仍然具有挑战性。
我们进行了计算基准测试,以评估实验参数和数据分析流程对细胞代表性的影响,从而重现已知的生物学相似性。我们运行了一万多次实验,以系统地研究覆盖范围和细胞数量、计数矩阵构建方法、特征选择和归一化以及使用的降维算法的影响。这使我们能够确定关键的实验参数和计算选择,以获得单细胞 HPTM 数据的良好表示。我们特别表明,计数矩阵构建步骤对表示质量有很大影响,并且基于潜在语义索引的降维方法优于基于注释的分箱。特征选择不利于降维,而只保留高质量的细胞对最终表示的影响很小,只要分析足够多的细胞。
该基准测试全面研究了实验参数和计算选择如何影响单细胞 HPTM 数据的表示。我们提出了一系列关于矩阵构建、特征和细胞选择以及降维算法的建议。