Centre of Excellence for Nutrition, North-West University, Potchefstroom, South Africa.
Centre of Excellence for Nutrition, North-West University, Potchefstroom, South Africa; Laboratory of Human Nutrition, Institute of Food, Nutrition and Health, ETH, Zurich, Switzerland.
Nutr Res. 2020 Mar;75:67-76. doi: 10.1016/j.nutres.2020.01.001. Epub 2020 Jan 9.
Principal component analysis (PCA) is a popular statistical tool. However, despite numerous advantages, the good practice of imputing missing data before PCA is not common. In the present work, we evaluated the hypothesis that the expectation-maximization (EM) algorithm for missing data imputation is a reliable and advantageous procedure when using PCA to derive biomarker profiles and dietary patterns. To this aim, we used numerical simulations aimed to mimic real data commonly observed in nutritional research. Finally, we showed the advantages and pitfalls of the EM algorithm for missing data imputation applied to plasma fatty acid concentrations and nutrient intakes from real data sets deriving from the US National Health and Nutrition Examination Survey. PCA applied to simulated data having missing values resulted in biased eigenvalues with respect to the original data set without missing values. The bias between the eigenvalues from the original set of data and from the data set with missing values increased with number of missing values and appeared as independent with respect to the correlation structure among variables. On the other hand, when data were imputed, the mean of the eigenvalues over the 10 missing imputation runs overlapped with the ones derived from the PCA applied to the original data set. These results were confirmed when real data sets from the National Health and Nutrition Examination Survey were analyzed. We accept the hypothesis that the EM algorithm for missing data imputation applied before PCA aimed to derive biochemical profiles and dietary patterns is an effective technique especially for relatively small sample sizes.
主成分分析(PCA)是一种流行的统计工具。然而,尽管有许多优点,但在进行 PCA 之前对缺失数据进行插补的良好实践并不常见。在本工作中,我们评估了以下假设:在使用 PCA 得出生物标志物图谱和膳食模式时,缺失数据的期望最大化(EM)算法插补是一种可靠且有利的方法。为此,我们使用了旨在模拟营养研究中常见的真实数据的数值模拟。最后,我们展示了 EM 算法插补缺失数据应用于真实数据集(源自美国国家健康和营养检查调查)中血浆脂肪酸浓度和营养素摄入量的优势和缺陷。将具有缺失值的模拟数据应用于 PCA 会导致相对于无缺失值的原始数据集的特征值产生偏差。特征值之间的偏差在原始数据集和具有缺失值的数据集之间随着缺失值的数量增加而增加,并且看起来与变量之间的相关结构无关。另一方面,当数据进行插补时,10 次缺失插补运行的特征值的平均值与从原始数据集应用 PCA 得出的特征值重叠。当分析来自国家健康和营养检查调查的真实数据集时,得到了这些结果。我们接受这样的假设:在进行 PCA 之前,应用于缺失数据插补的 EM 算法是一种有效的技术,特别是对于相对较小的样本量,旨在得出生化图谱和膳食模式。