Faquih Tariq, van Smeden Maarten, Luo Jiao, le Cessie Saskia, Kastenmüller Gabi, Krumsiek Jan, Noordam Raymond, van Heemst Diana, Rosendaal Frits R, van Hylckama Vlieg Astrid, Willems van Dijk Ko, Mook-Kanamori Dennis O
Department of Clinical Epidemiology, Leiden University Medical Center, Postal Zone C7-P, PO Box 9600, 2300 RC Leiden, The Netherlands.
Julius Center for Health Sciences and Primary Care, University Medical Centre Utrecht, Utrecht University, 8, 3584 Utrecht, The Netherlands.
Metabolites. 2020 Nov 26;10(12):486. doi: 10.3390/metabo10120486.
Metabolomics studies have seen a steady growth due to the development and implementation of affordable and high-quality metabolomics platforms. In large metabolite panels, measurement values are frequently missing and, if neglected or sub-optimally imputed, can cause biased study results. We provided a publicly available, user-friendly script to streamline the imputation of missing endogenous, unannotated, and xenobiotic metabolites. We evaluated the multivariate imputation by chained equations (MICE) and k-nearest neighbors (kNN) analyses implemented in our script by simulations using measured metabolites data from the Netherlands Epidemiology of Obesity (NEO) study ( = 599). We simulated missing values in four unique metabolites from different pathways with different correlation structures in three sample sizes (599, 150, 50) with three missing percentages (15%, 30%, 60%), and using two missing mechanisms (completely at random and not at random). Based on the simulations, we found that for MICE, larger sample size was the primary factor decreasing bias and error. For kNN, the primary factor reducing bias and error was the metabolite correlation with its predictor metabolites. MICE provided consistently higher performance measures particularly for larger datasets ( > 50). In conclusion, we presented an imputation workflow in a publicly available script to impute untargeted metabolomics data. Our simulations provided insight into the effects of sample size, percentage missing, and correlation structure on the accuracy of the two imputation methods.
由于经济实惠且高质量的代谢组学平台的开发与应用,代谢组学研究呈现出稳步增长的态势。在大型代谢物面板中,测量值常常缺失,如果被忽视或插补方法欠佳,可能会导致有偏差的研究结果。我们提供了一个公开可用且用户友好的脚本,以简化对内源性、未注释和外源性代谢物缺失值的插补。我们通过使用来自荷兰肥胖流行病学(NEO)研究(n = 599)的实测代谢物数据进行模拟,评估了我们脚本中实现的链式方程多元插补(MICE)和k近邻(kNN)分析。我们在三种样本量(599、150、50)、三种缺失百分比(15%、30%、60%)的情况下,使用两种缺失机制(完全随机和非随机),对来自不同途径且具有不同相关结构的四种独特代谢物模拟缺失值。基于模拟结果,我们发现对于MICE而言,较大的样本量是降低偏差和误差的主要因素。对于kNN,降低偏差和误差的主要因素是代谢物与其预测代谢物之间的相关性。MICE尤其在较大数据集(n > 50)时提供了始终更高的性能指标。总之,我们在一个公开可用的脚本中展示了一种插补工作流程,用于插补非靶向代谢组学数据。我们的模拟深入了解了样本量、缺失百分比和相关结构对两种插补方法准确性的影响。