Wilson Machelle D, Ponzini Matthew D, Taylor Sandra L, Kim Kyoungmi
Department of Public Health Sciences, University of California, Davis, Sacramento, CA 95817, USA.
Department of Public Health Sciences, University of California, Davis, Davis, CA 95616, USA.
Metabolites. 2022 Jul 21;12(7):671. doi: 10.3390/metabo12070671.
The analysis of high-throughput metabolomics mass spectrometry data across multiple biological sample types (biospecimens) poses challenges due to missing data. During differential abundance analysis, dropping samples with missing values can lead to severe loss of data as well as biased results in group comparisons and effect size estimates. However, the imputation of missing data (the process of replacing missing data with estimated values such as a mean) may compromise the inherent intra-subject correlation of a metabolite across multiple biospecimens from the same subject, which in turn may compromise the efficacy of the statistical analysis of differential metabolites in biomarker discovery. We investigated imputation strategies when considering multiple biospecimens from the same subject. We compared a novel, but simple, approach that consists of combining the two biospecimen data matrices (rows and columns of subjects and metabolites) and imputes the two biospecimen data matrices together to an approach that imputes each biospecimen data matrix separately. We then compared the bias in the estimation of the intra-subject multi-specimen correlation and its effects on the validity of statistical significance tests between two approaches. The combined approach to multi-biospecimen studies has not been evaluated previously even though it is intuitive and easy to implement. We examine these two approaches for five imputation methods: random forest, k nearest neighbor, expectation-maximization with bootstrap, quantile regression, and half the minimum observed value. Combining the biospecimen data matrices for imputation did not greatly increase efficacy in conserving the correlation structure or improving accuracy in the statistical conclusions for most of the methods examined. Random forest tended to outperform the other methods in all performance metrics, except specificity.
由于存在缺失数据,对多种生物样本类型(生物标本)的高通量代谢组学质谱数据进行分析面临挑战。在差异丰度分析过程中,舍弃存在缺失值的样本可能会导致严重的数据丢失,以及在组间比较和效应大小估计中产生有偏差的结果。然而,缺失数据的插补(即用估计值如均值替换缺失数据的过程)可能会损害同一受试者多个生物标本中代谢物固有的受试者内相关性,这反过来可能会损害生物标志物发现中差异代谢物统计分析的功效。我们研究了在考虑同一受试者的多个生物标本时的插补策略。我们将一种新颖但简单的方法(即将两个生物标本数据矩阵(受试者和代谢物的行与列)合并并一起对两个生物标本数据矩阵进行插补)与一种分别对每个生物标本数据矩阵进行插补的方法进行了比较。然后,我们比较了两种方法在受试者内多标本相关性估计中的偏差及其对统计显著性检验有效性的影响。多生物标本研究的合并方法尽管直观且易于实施,但此前尚未得到评估。我们针对五种插补方法(随机森林、k近邻、带自助法的期望最大化、分位数回归和最小观测值的一半)研究了这两种方法。对于大多数所研究的方法,合并生物标本数据矩阵进行插补在保留相关结构或提高统计结论准确性方面并没有显著提高功效。除特异性外,随机森林在所有性能指标上往往优于其他方法。