Tecnologico de Monterrey, Escuela de Medicina y Ciencias de la Salud, Monterrey, Nuevo Leon, 64710, Mexico.
The Vivian L. Smith Department of Neurosurgery, McGovern Medical School, The University of Texas Health Science Center at Houston, Houston, TX, 77030, USA.
BMC Genomics. 2024 May 6;25(1):444. doi: 10.1186/s12864-024-10364-5.
Normalization is a critical step in the analysis of single-cell RNA-sequencing (scRNA-seq) datasets. Its main goal is to make gene counts comparable within and between cells. To do so, normalization methods must account for technical and biological variability. Numerous normalization methods have been developed addressing different sources of dispersion and making specific assumptions about the count data.
The selection of a normalization method has a direct impact on downstream analysis, for example differential gene expression and cluster identification. Thus, the objective of this review is to guide the reader in making an informed decision on the most appropriate normalization method to use. To this aim, we first give an overview of the different single cell sequencing platforms and methods commonly used including isolation and library preparation protocols. Next, we discuss the inherent sources of variability of scRNA-seq datasets. We describe the categories of normalization methods and include examples of each. We also delineate imputation and batch-effect correction methods. Furthermore, we describe data-driven metrics commonly used to evaluate the performance of normalization methods. We also discuss common scRNA-seq methods and toolkits used for integrated data analysis.
According to the correction performed, normalization methods can be broadly classified as within and between-sample algorithms. Moreover, with respect to the mathematical model used, normalization methods can further be classified into: global scaling methods, generalized linear models, mixed methods, and machine learning-based methods. Each of these methods depict pros and cons and make different statistical assumptions. However, there is no better performing normalization method. Instead, metrics such as silhouette width, K-nearest neighbor batch-effect test, or Highly Variable Genes are recommended to assess the performance of normalization methods.
归一化是单细胞 RNA 测序(scRNA-seq)数据分析的关键步骤。其主要目标是使细胞内和细胞间的基因计数具有可比性。为此,归一化方法必须考虑技术和生物学变异性。已经开发了许多归一化方法来解决不同的分散源,并对计数数据做出特定假设。
归一化方法的选择对下游分析有直接影响,例如差异基因表达和聚类识别。因此,本综述的目的是指导读者就使用最合适的归一化方法做出明智的决定。为此,我们首先概述了不同的单细胞测序平台和常用的方法,包括分离和文库制备方案。接下来,我们讨论了 scRNA-seq 数据集固有的变异性来源。我们描述了归一化方法的类别,并包括每种方法的示例。我们还划定了插补和批次效应校正方法。此外,我们描述了常用的数据驱动指标来评估归一化方法的性能。我们还讨论了用于集成数据分析的常见 scRNA-seq 方法和工具包。
根据所执行的校正,归一化方法可以大致分为样本内和样本间算法。此外,根据所使用的数学模型,归一化方法可以进一步分为:全局缩放方法、广义线性模型、混合方法和基于机器学习的方法。这些方法中的每一种都有其优点和缺点,并做出不同的统计假设。然而,没有一种归一化方法表现更好。相反,建议使用轮廓宽度、K-最近邻批效应测试或高度可变基因等指标来评估归一化方法的性能。