Low Brian, Wang Yukai, Zhao Tingting, Yu Huaxu, Huan Tao
Department of Chemistry, Faculty of Science, University of British Columbia, Vancouver Campus, 2036 Main Mall, Vancouver, BC V6T 1Z1, Canada.
ACS Meas Sci Au. 2024 Oct 14;4(6):702-711. doi: 10.1021/acsmeasuresciau.4c00047. eCollection 2024 Dec 18.
Sample normalization is a crucial step in metabolomics for fair quantitative comparisons. It aims to minimize sample-to-sample variations due to differences in the total metabolite amount. When samples lack a specific metabolic quantity to accurately represent their total metabolite amounts, post-acquisition sample normalization becomes essential. Despite many proposed normalization algorithms, understanding remains limited of their differences, hindering the selection of the most suitable one for a given metabolomics study. This study bridges this knowledge gap by employing data simulation, experimental simulation, and real experiments to elucidate the differences in the mechanism and performance among common post-acquisition sample normalization methods. Using public datasets, we first demonstrated the dramatic discrepancies between the outcomes of different sample normalization methods. Then, we benchmarked six normalization methods: sum, median, probabilistic quotient normalization (PQN), maximal density fold change (MDFC), quantile, and class-specific quantile. Our results show that most normalization methods are biased when there is unbalanced data, a phenomenon where the percentages of up- and downregulated metabolites are unequal. Notably, unbalanced data can be sourced from the underlying biological differences, experimental perturbations, and metabolic interference. Beyond normalization algorithms and data structure, our study also emphasizes the importance of considering additional factors contributed by data quality, such as background noise, signal saturation, and missingness. Based on these findings, we propose an evidence-based normalization strategy to maximize sample normalization outcomes, providing a robust bioinformatic solution for advancing metabolomics research with a fair quantitative comparison.
样本归一化是代谢组学中进行公平定量比较的关键步骤。其目的是尽量减少由于总代谢物量差异导致的样本间变化。当样本缺乏特定的代谢物量来准确代表其总代谢物量时,采集后样本归一化就变得至关重要。尽管提出了许多归一化算法,但对它们之间差异的理解仍然有限,这阻碍了为给定的代谢组学研究选择最合适的算法。本研究通过数据模拟、实验模拟和实际实验来弥合这一知识差距,以阐明常见采集后样本归一化方法在机制和性能上的差异。使用公共数据集,我们首先展示了不同样本归一化方法结果之间的巨大差异。然后,我们对六种归一化方法进行了基准测试:总和、中位数、概率商归一化(PQN)、最大密度倍数变化(MDFC)、分位数和类别特定分位数。我们的结果表明,当存在数据不平衡(即上调和下调代谢物百分比不相等的现象)时,大多数归一化方法都会产生偏差。值得注意的是,数据不平衡可能源于潜在的生物学差异、实验扰动和代谢干扰。除了归一化算法和数据结构,我们的研究还强调了考虑数据质量所带来的其他因素的重要性,如背景噪声、信号饱和和缺失值。基于这些发现,我们提出了一种基于证据的归一化策略,以最大化样本归一化结果,为通过公平定量比较推进代谢组学研究提供一个强大的生物信息学解决方案。