Burcham Sara, Liu Yuki, Merianos Ashley L, Mendy Angelico
Division of Epidemiology, Department of Environmental and Public Health Sciences, University of Cincinnati College of Medicine, Cincinnati, OH, USA.
Intuitive Surgical, Inc., Global Health Economics and Outcomes Research, Sunnyvale, CA, USA.
Epidemiol Methods. 2023 Nov 10;12(1):20230018. doi: 10.1515/em-2023-0018. eCollection 2023 Jan.
An important step in preparing data for statistical analysis is outlier detection and removal, yet no gold standard exists in current literature. The objective of this study is to identify the ideal decision test using the National Health and Nutrition Examination Survey (NHANES) 2017-2018 dietary data.
We conducted a secondary analysis of NHANES 24-h dietary recalls, considering the survey's multi-stage cluster design. Six outlier detection and removal strategies were assessed by evaluating the decision tests' impact on the Pearson's correlation coefficient among macronutrients. Furthermore, we assessed changes in the effect size estimates based on pre-defined sample sizes. The data were collected as part of the 2017-2018 24-h dietary recall among adult participants (N=4,893).
Effect estimate changes for macronutrients varied from 6.5 % for protein to 39.3 % for alcohol across all decision tests. The largest proportion of outliers removed was 4.0 % in the large sample size, for the decision test, >2 standard deviations from the mean. The smallest sample size, particularly for alcohol analysis, was most affected by the six decision tests when compared to no decision test.
This study, the first to use 2017-2018 NHANES dietary data for outlier evaluation, emphasizes the importance of selecting an appropriate decision test considering factors such as statistical power, sample size, normality assumptions, the proportion of data removed, effect estimate changes, and the consistency of estimates across sample sizes. We recommend the use of non-parametric tests for non-normally distributed variables of interest.
在为统计分析准备数据时,一个重要步骤是异常值检测与去除,但当前文献中不存在金标准。本研究的目的是使用2017 - 2018年美国国家健康与营养检查调查(NHANES)的饮食数据确定理想的决策检验方法。
我们对NHANES的24小时饮食回忆进行了二次分析,考虑了该调查的多阶段整群设计。通过评估决策检验对宏量营养素之间皮尔逊相关系数的影响,评估了六种异常值检测与去除策略。此外,我们根据预先定义的样本量评估了效应量估计值的变化。这些数据是作为2017 - 2018年成年参与者24小时饮食回忆的一部分收集的(N = 4893)。
在所有决策检验中,宏量营养素的效应估计值变化范围从蛋白质的6.5%到酒精的39.3%。对于“大于均值2个标准差”的决策检验,在大样本量中去除的异常值比例最大,为4.0%。与不进行决策检验相比,最小样本量,特别是对于酒精分析,受六种决策检验的影响最大。
本研究首次使用2017 - 2018年NHANES饮食数据进行异常值评估,强调了在选择合适的决策检验时考虑统计功效、样本量、正态性假设、去除的数据比例、效应估计值变化以及不同样本量估计值的一致性等因素的重要性。我们建议对感兴趣的非正态分布变量使用非参数检验。