Tseng Chi Yen, Salguero Jessica A, Breidenbach Joshua D, Solomon Emilia, Sanders Claire K, Harvey Tara, Thornhill M Grace, Palmisano Salvator J, Sasiene Zachary J, Blackwell Brett R, McBride Ethan M, Luchini Kes A, LeBrun Erick S, Alvarez Marc, Mach Phillip M, Rivera Emilio S, Glaros Trevor G
Biochemistry and Biotechnology Group, Bioscience Division, Los Alamos National Laboratory, Los Alamos, NM, 84545, USA.
Microbial and Biome Sciences Group, Bioscience Division, Los Alamos National Laboratory, Los Alamos, NM, 87545, USA.
Metabolomics. 2025 Jul 1;21(4):98. doi: 10.1007/s11306-025-02297-1.
Data normalization is crucial for multi-omics integration, reducing systematic errors and maximizing the likelihood of discovering true biological variation. Most studies assess normalization for a single omics type or use datasets from separate experiments. Few address time-course data, where normalization might bias temporal differentiation. In this study, we compared common normalization methods and a machine learning approach, Systematical Error Removal using Random Forest (SERRF), using multi-omics datasets generated from the same experiment-even from the same cell lysate.
To develop a straightforward process to assess normalization effects and identify the most robust methods across multi-omics datasets.
We analyzed metabolomics, lipidomics, and proteomics datasets from primary human cardiomyocytes and motor neurons exposed to acetylcholine-active compounds over time. Normalization effectiveness was evaluated based on improvement in QC features consistency and observing the change in treatment and time-related variance.
Probabilistic Quotient Normalization (PQN) and Locally Estimated Scatterplot Smoothing (LOESS) QC were identified as optimal for metabolomics and lipidomics, while PQN, Median, and LOESS normalization excelled for proteomics. These methods consistently enhanced QC feature consistency in metabolomics and lipidomics, and preserved time-related variance or treatment-related variance in proteomics, demonstrating their effectiveness and robustness. SERRF normalization, applied only to metabolomics in this study, outperformed other methods in some datasets but inadvertently masked treatment-related variance in others.
Our evaluation identified PQN and LoessQC as the top methods for metabolomics and lipidomics, and PQN, Median, and Loess normalization for proteomics, in multi-omics integration in a temporal study.
数据归一化对于多组学整合至关重要,它可减少系统误差并最大化发现真实生物学变异的可能性。大多数研究评估单一组学类型的归一化,或使用来自单独实验的数据集。很少有研究涉及时间进程数据,而归一化可能会使时间差异产生偏差。在本研究中,我们使用来自同一实验甚至同一细胞裂解物生成的多组学数据集,比较了常见的归一化方法和一种机器学习方法——使用随机森林去除系统误差(SERRF)。
开发一个简单的流程来评估归一化效果,并在多组学数据集中识别最稳健的方法。
我们分析了原代人心肌细胞和运动神经元随时间暴露于乙酰胆碱活性化合物后的代谢组学、脂质组学和蛋白质组学数据集。基于质量控制(QC)特征一致性的改善以及观察处理和时间相关方差的变化来评估归一化效果。
概率商归一化(PQN)和局部估计散点图平滑(LOESS)质量控制被确定为代谢组学和脂质组学的最佳方法,而PQN、中位数和LOESS归一化在蛋白质组学方面表现出色。这些方法持续增强了代谢组学和脂质组学中QC特征的一致性,并在蛋白质组学中保留了时间相关方差或处理相关方差,证明了它们的有效性和稳健性。在本研究中仅应用于代谢组学的SERRF归一化在某些数据集中优于其他方法,但在其他数据集中无意中掩盖了处理相关方差。
我们的评估确定了在时间研究的多组学整合中,PQN和LoessQC是代谢组学和脂质组学的顶级方法,而PQN、中位数和Loess归一化是蛋白质组学的顶级方法。