Department of Epidemiology and Biostatistics, School of Public Health, Harbin Medical University, Harbin, 150086, China.
Laboratory of Hematology Center, First Affiliated Hospital of Harbin Medical University, Harbin, 150086, China.
Anal Chim Acta. 2019 Jul 11;1061:60-69. doi: 10.1016/j.aca.2019.02.010. Epub 2019 Feb 19.
Metabolomics provides new insights into disease pathogenesis and biomarker discovery. Samples from large-scale untargeted metabolomics studies are typically analyzed using a liquid chromatography-mass spectrometry platform in several batches. Batch effects that are caused by non-biological systematic biases are unavoidable in large-scale metabolomics studies, even with properly designed experiments. The statistical analysis of large-scale metabolomics data without managing batch effects will yield misleading results. In this study, we propose a novel algorithm, called WaveICA, which is based on the wavelet transform method with independent component analysis, as the threshold processing method to capture and remove batch effects for large-scale metabolomics data. The WaveICA method uses the time trend of samples over the injection order, decomposes the original data into multi-scale data with different features, extracts and removes the batch effect information in multi-scale data, and obtains clean data. The WaveICA method was tested on real metabolomics data. After applying the WaveICA method, scattered quality control samples (QCS) and subject samples in a PCA score plot of the original data were closely clustered, respectively. The average Pearson correlation coefficients for all peaks of the QCS increased from 0.872 to 0.972. Additionally, WaveICA significantly improved the classification accuracy for metabolomics data. The method was compared with three representative methods, and outperformed all of them. To conclude, WaveICA can efficiently remove batch effects while revealing more biological information. This method can be used in large-scale untargeted metabolomics studies to preprocess raw metabolomics data.
代谢组学为疾病发病机制和生物标志物的发现提供了新的见解。在几批实验中,通常使用液相色谱-质谱联用平台对来自大规模非靶向代谢组学研究的样本进行分析。即使实验设计合理,在大规模代谢组学研究中,由非生物系统偏差引起的批次效应是不可避免的。如果不对大型代谢组学数据进行批次效应管理就进行统计分析,将得出误导性的结果。在这项研究中,我们提出了一种新的算法,称为 WaveICA,它是基于小波变换方法与独立成分分析的方法,作为处理阈值的方法,用于捕获和去除大规模代谢组学数据中的批次效应。WaveICA 方法利用样品在注射顺序上的时间趋势,将原始数据分解为具有不同特征的多尺度数据,提取和去除多尺度数据中的批次效应信息,并获得干净的数据。在真实的代谢组学数据上测试了 WaveICA 方法。在应用 WaveICA 方法后,原始数据 PCA 得分图中分散的质控样品(QCS)和个体样品分别紧密聚类。QCS 中所有峰的平均 Pearson 相关系数从 0.872 增加到 0.972。此外,WaveICA 显著提高了代谢组学数据的分类准确性。该方法与三种代表性方法进行了比较,均优于所有方法。总之,WaveICA 可以在去除批次效应的同时,揭示更多的生物学信息。该方法可用于大规模非靶向代谢组学研究,用于预处理原始代谢组学数据。