Lu Keyi, Liu Yaru, Cheng Kian-Kai, Guo Fanjing, Deng Lingli, Dong Jiyang
Department of Electronic Science, National Institute for Data Science in Health and Medicine, Xiamen University, Xiamen, 361005, China.
Faculty of Chemical and Energy Engineering, Universiti Teknologi Malaysia, Johor Bahru, Johor, 81310, Malaysia.
Anal Chim Acta. 2025 Oct 22;1372:344440. doi: 10.1016/j.aca.2025.344440. Epub 2025 Jul 16.
Metabolomics studies often grapple with the dilution effect, where sample concentrations vary due to inconsistent handling or biological diversity, particularly in samples like urine, saliva, or cell extracts. This variation can mask true metabolic differences, complicating data interpretation. Traditional normalization methods, such as Constant Sum Normalization (CSN), Probabilistic Quotient Normalization (PQN), and Maximal Density Fold Change (MDFC), assume that all samples share a certain invariant statistic and overlook data heterogeneity, potentially erasing the dataset's heterogeneity essential for distinguishing biological subgroups.
To address this, we introduce Local Neighbor Normalization (LNN), a novel approach that corrects for dilution effects while preserving the intrinsic variability of metabolomics data. LNN identifies a neighbor set for each sample based on similarity metrics and normalizes each sample against a tailored reference spectrum derived from these neighbors. Through comprehensive evaluations on both simulated and real metabolomics datasets from NMR, GC-MS, and LC-MS platforms, LNN demonstrated superior performance over CSN, PQN, and MDFC. Specifically, it achieved better elimination of dilution effects, recovery of inter-sample heterogeneity and inter-metabolite correlations, as evidenced by metrics such as the D-statistic and correlation recovery rates. Notably, LNN excels in datasets with over 50 % differential metabolites, safeguarding local data structures critically for downstream analyses like biomarker discovery.
LNN constructs sample-specific reference spectra based on a local neighbor set. This approach ensures that normalization accounts for dilution effects without compromising local structure of the data, which is crucial for biological interpretation. Additionally, LNN demonstrates superior performance in recovering inter-sample heterogeneity and metabolite correlations, especially in datasets with high proportions of differential metabolites. This method's versatility, robustness against noise, and applicability across various metabolomics platforms make it a significant advancement in the field.
代谢组学研究常常面临稀释效应的困扰,即由于处理方式不一致或生物多样性导致样本浓度变化,尤其是在尿液、唾液或细胞提取物等样本中。这种变化会掩盖真正的代谢差异,使数据解读变得复杂。传统的归一化方法,如恒和归一化(CSN)、概率商归一化(PQN)和最大密度倍数变化(MDFC),假定所有样本共享某个不变的统计量,而忽略了数据的异质性,这可能会消除对于区分生物亚组至关重要的数据集异质性。
为解决这一问题,我们引入了局部邻域归一化(LNN),这是一种新颖的方法,可校正稀释效应,同时保留代谢组学数据的内在变异性。LNN基于相似性度量为每个样本识别一个邻域集,并根据从这些邻域导出的定制参考光谱对每个样本进行归一化。通过对来自核磁共振(NMR)、气相色谱 - 质谱联用(GC - MS)和液相色谱 - 质谱联用(LC - MS)平台的模拟和真实代谢组学数据集进行全面评估,LNN表现出优于CSN、PQN和MDFC的性能。具体而言,它在消除稀释效应、恢复样本间异质性和代谢物间相关性方面表现更佳,如D统计量和相关性恢复率等指标所示。值得注意的是,LNN在差异代谢物超过50%的数据集中表现出色,对于生物标志物发现等下游分析至关重要的局部数据结构起到了保护作用。
LNN基于局部邻域集构建样本特异性参考光谱。这种方法确保归一化在不损害数据局部结构的情况下考虑稀释效应,这对于生物学解释至关重要。此外,LNN在恢复样本间异质性和代谢物相关性方面表现出色,尤其是在差异代谢物比例高的数据集中。该方法的通用性、抗噪声鲁棒性以及在各种代谢组学平台上的适用性使其成为该领域的一项重大进展。