Suppr超能文献

如何正确地对基因表达数据分析进行分位数归一化。

How to do quantile normalization correctly for gene expression data analyses.

机构信息

School of Pharmaceutical Science and Technology, Tianjin University, Tianjin, China.

Department of Computer Science, National University of Singapore, Singapore, Singapore.

出版信息

Sci Rep. 2020 Sep 23;10(1):15534. doi: 10.1038/s41598-020-72664-6.

Abstract

Quantile normalization is an important normalization technique commonly used in high-dimensional data analysis. However, it is susceptible to class-effect proportion effects (the proportion of class-correlated variables in a dataset) and batch effects (the presence of potentially confounding technical variation) when applied blindly on whole data sets, resulting in higher false-positive and false-negative rates. We evaluate five strategies for performing quantile normalization, and demonstrate that good performance in terms of batch-effect correction and statistical feature selection can be readily achieved by first splitting data by sample class-labels before performing quantile normalization independently on each split ("Class-specific"). Via simulations with both real and simulated batch effects, we demonstrate that the "Class-specific" strategy (and others relying on similar principles) readily outperform whole-data quantile normalization, and is robust-preserving useful signals even during the combined analysis of separately-normalized datasets. Quantile normalization is a commonly used procedure. But when carelessly applied on whole datasets without first considering class-effect proportion and batch effects, can result in poor performance. If quantile normalization must be used, then we recommend using the "Class-specific" strategy.

摘要

分位数归一化是一种常用于高维数据分析的重要归一化技术。然而,当盲目地应用于整个数据集时,它容易受到类效应比例效应(数据集中文档相关变量的比例)和批次效应(潜在混杂技术变化的存在)的影响,导致更高的假阳性和假阴性率。我们评估了五种执行分位数归一化的策略,并证明通过在独立执行分位数归一化之前按样本类别标签分割数据(“类别特定”),可以轻松实现批次效应校正和统计特征选择方面的良好性能。通过对真实和模拟批次效应的模拟,我们证明了“类别特定”策略(以及其他依赖类似原理的策略)可以轻松优于整个数据的分位数归一化,并且即使在分别归一化数据集的联合分析中,也能保持稳健性并保留有用信号。分位数归一化是一种常用的过程。但是,如果在不首先考虑类效应比例和批次效应的情况下在整个数据集上小心地应用,可能会导致性能不佳。如果必须使用分位数归一化,那么我们建议使用“类别特定”策略。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/6fb1/7511327/8b4f79ba30ba/41598_2020_72664_Fig1_HTML.jpg

相似文献

1
How to do quantile normalization correctly for gene expression data analyses.
Sci Rep. 2020 Sep 23;10(1):15534. doi: 10.1038/s41598-020-72664-6.
3
Batch effect correction for genome-wide methylation data with Illumina Infinium platform.
BMC Med Genomics. 2011 Dec 16;4:84. doi: 10.1186/1755-8794-4-84.
5
Recurrent functional misinterpretation of RNA-seq data caused by sample-specific gene length bias.
PLoS Biol. 2019 Nov 12;17(11):e3000481. doi: 10.1371/journal.pbio.3000481. eCollection 2019 Nov.
6
7
Normics: Proteomic Normalization by Variance and Data-Inherent Correlation Structure.
Mol Cell Proteomics. 2022 Sep;21(9):100269. doi: 10.1016/j.mcpro.2022.100269. Epub 2022 Jul 16.
8
A comparison of per sample global scaling and per gene normalization methods for differential expression analysis of RNA-seq data.
PLoS One. 2017 May 1;12(5):e0176185. doi: 10.1371/journal.pone.0176185. eCollection 2017.
10
Super-delta: a new differential gene expression analysis procedure with robust data normalization.
BMC Bioinformatics. 2017 Dec 21;18(1):582. doi: 10.1186/s12859-017-1992-2.

引用本文的文献

1
Progress and new challenges in image-based profiling.
ArXiv. 2025 Aug 7:arXiv:2508.05800v1.
2
Cross-Dataset Evaluation of Dementia Longitudinal Progression Prediction Models.
Hum Brain Mapp. 2025 Aug 1;46(11):e70280. doi: 10.1002/hbm.70280.
4
Evaluation of normalization strategies for mass spectrometry-based multi-omics datasets.
Metabolomics. 2025 Jul 1;21(4):98. doi: 10.1007/s11306-025-02297-1.
5
Identification and correction of time-series transcriptomic anomalies.
Nucleic Acids Res. 2025 Jun 20;53(12). doi: 10.1093/nar/gkaf524.
6
Spatio-temporal dynamics of human-induced carbon emissions in Southeast Asia (1992-2022) based on nighttime light.
Eco Environ Health. 2025 Apr 26;4(2):100150. doi: 10.1016/j.eehl.2025.100150. eCollection 2025 Jun.
9
Epitranscriptomic analysis reveals clinical and molecular signatures in glioblastoma.
Acta Neuropathol Commun. 2025 Apr 11;13(1):74. doi: 10.1186/s40478-025-01966-5.
10
Traditional Chinese medicine as a viable option for managing vascular cognitive impairment: A ray of hope.
Medicine (Baltimore). 2025 Mar 14;104(11):e41694. doi: 10.1097/MD.0000000000041694.

本文引用的文献

1
Dealing with Confounders in Omics Analysis.
Trends Biotechnol. 2018 May;36(5):488-498. doi: 10.1016/j.tibtech.2018.01.013. Epub 2018 Feb 20.
2
Smooth quantile normalization.
Biostatistics. 2018 Apr 1;19(2):185-198. doi: 10.1093/biostatistics/kxx028.
3
Characterization of background noise in capture-based targeted sequencing data.
Genome Biol. 2017 Jul 21;18(1):136. doi: 10.1186/s13059-017-1275-2.
4
NetProt: Complex-based Feature Selection.
J Proteome Res. 2017 Aug 4;16(8):3102-3112. doi: 10.1021/acs.jproteome.7b00363. Epub 2017 Jul 7.
6
The Impact of Normalization Methods on RNA-Seq Data Analysis.
Biomed Res Int. 2015;2015:621690. doi: 10.1155/2015/621690. Epub 2015 Jun 15.
7
quantro: a data-driven approach to guide the choice of an appropriate normalization method.
Genome Biol. 2015 Jun 4;16(1):117. doi: 10.1186/s13059-015-0679-0.
9
Batch effect confounding leads to strong bias in performance estimates obtained by cross-validation.
PLoS One. 2014 Jun 26;9(6):e100335. doi: 10.1371/journal.pone.0100335. eCollection 2014.
10
Deciphering global signal features of high-throughput array data from cancers.
Mol Biosyst. 2014 Jun;10(6):1549-56. doi: 10.1039/c4mb00084f. Epub 2014 Apr 3.

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验