Viallon Vivian, His Mathilde, Rinaldi Sabina, Breeur Marie, Gicquiau Audrey, Hemon Bertrand, Overvad Kim, Tjønneland Anne, Rostgaard-Hansen Agnetha Linn, Rothwell Joseph A, Lecuyer Lucie, Severi Gianluca, Kaaks Rudolf, Johnson Theron, Schulze Matthias B, Palli Domenico, Agnoli Claudia, Panico Salvatore, Tumino Rosario, Ricceri Fulvio, Verschuren W M Monique, Engelfriet Peter, Onland-Moret Charlotte, Vermeulen Roel, Nøst Therese Haugdahl, Urbarova Ilona, Zamora-Ros Raul, Rodriguez-Barranco Miguel, Amiano Pilar, Huerta José Maria, Ardanaz Eva, Melander Olle, Ottoson Filip, Vidman Linda, Rentoft Matilda, Schmidt Julie A, Travis Ruth C, Weiderpass Elisabete, Johansson Mattias, Dossus Laure, Jenab Mazda, Gunter Marc J, Lorenzo Bermejo Justo, Scherer Dominique, Salek Reza M, Keski-Rahkonen Pekka, Ferrari Pietro
Nutrition and Metabolism Branch, International Agency for Research on Cancer (IARC-WHO), 69008 Lyon, France.
Department of Public Health, Aarhus University Bartholins Alle 2, DK-8000 Aarhus, Denmark.
Metabolites. 2021 Sep 17;11(9):631. doi: 10.3390/metabo11090631.
Pooling metabolomics data across studies is often desirable to increase the statistical power of the analysis. However, this can raise methodological challenges as several preanalytical and analytical factors could introduce differences in measured concentrations and variability between datasets. Specifically, different studies may use variable sample types (e.g., serum versus plasma) collected, treated, and stored according to different protocols, and assayed in different laboratories using different instruments. To address these issues, a new pipeline was developed to normalize and pool metabolomics data through a set of sequential steps: (i) exclusions of the least informative observations and metabolites and removal of outliers; imputation of missing data; (ii) identification of the main sources of variability through principal component partial R-square (PC-PR2) analysis; (iii) application of linear mixed models to remove unwanted variability, including samples' originating study and batch, and preserve biological variations while accounting for potential differences in the residual variances across studies. This pipeline was applied to targeted metabolomics data acquired using Biocrates AbsoluteIDQ kits in eight case-control studies nested within the European Prospective Investigation into Cancer and Nutrition (EPIC) cohort. Comprehensive examination of metabolomics measurements indicated that the pipeline improved the comparability of data across the studies. Our pipeline can be adapted to normalize other molecular data, including biomarkers as well as proteomics data, and could be used for pooling molecular datasets, for example in international consortia, to limit biases introduced by inter-study variability. This versatility of the pipeline makes our work of potential interest to molecular epidemiologists.
跨研究汇总代谢组学数据通常有助于提高分析的统计效力。然而,这可能会带来方法学上的挑战,因为一些分析前和分析因素可能会导致数据集之间测量浓度和变异性的差异。具体而言,不同的研究可能使用根据不同方案收集、处理和存储的不同样本类型(例如血清与血浆),并在不同实验室使用不同仪器进行检测。为了解决这些问题,开发了一种新的流程,通过一系列连续步骤对代谢组学数据进行标准化和汇总:(i) 排除信息量最少的观测值和代谢物,并去除异常值;对缺失数据进行插补;(ii) 通过主成分偏R平方(PC-PR2)分析确定变异性的主要来源;(iii) 应用线性混合模型去除不必要的变异性,包括样本的原始研究和批次,并在考虑各研究残差方差潜在差异的同时保留生物学变异。该流程应用于在欧洲癌症与营养前瞻性调查(EPIC)队列中的八项病例对照研究中使用Biocrates AbsoluteIDQ试剂盒获取的靶向代谢组学数据。对代谢组学测量的全面检查表明,该流程提高了各研究数据的可比性。我们的流程可适用于标准化其他分子数据,包括生物标志物以及蛋白质组学数据,并且可用于汇总分子数据集,例如在国际联盟中,以限制研究间变异性引入的偏差。该流程的这种通用性使我们的工作对分子流行病学家具有潜在的吸引力。