Batterham Marijka, Neale Elizabeth, Martin Allison, Tapsell Linda
Statistical Consulting Centre, National Institute for Applied Statistics Research Australia, University of Wollongong, Wollongong, New South Wales, Australia.
School of Medicine, Faculty of Science, Medicine and Health, University of Wollongong, Wollongong, New South Wales, Australia.
Nutr Diet. 2017 Feb;74(1):3-10. doi: 10.1111/1747-0080.12337.
Data mining enables further insights from nutrition-related research, but caution is required. The aim of this analysis was to demonstrate and compare the utility of data mining methods in classifying a categorical outcome derived from a nutrition-related intervention.
Baseline data (23 variables, 8 categorical) on participants (n = 295) in an intervention trial were used to classify participants in terms of meeting the criteria of achieving 10 000 steps per day. Results from classification and regression trees (CARTs), random forests, adaptive boosting, logistic regression, support vector machines and neural networks were compared using area under the curve (AUC) and error assessments.
The CART produced the best model when considering the AUC (0.703), overall error (18%) and within class error (28%). Logistic regression also performed reasonably well compared to the other models (AUC 0.675, overall error 23%, within class error 36%). All the methods gave different rankings of variables' importance. CART found that body fat, quality of life using the SF-12 Physical Component Summary (PCS) and the cholesterol: HDL ratio were the most important predictors of meeting the 10 000 steps criteria, while logistic regression showed the SF-12PCS, glucose levels and level of education to be the most significant predictors (P ≤ 0.01).
Differing outcomes suggest caution is required with a single data mining method, particularly in a dataset with nonlinear relationships and outliers and when exploring relationships that were not the primary outcomes of the research.
数据挖掘有助于从营养相关研究中获得更深入的见解,但需要谨慎使用。本分析的目的是展示和比较数据挖掘方法在对营养相关干预得出的分类结果进行分类时的效用。
干预试验中参与者(n = 295)的基线数据(23个变量,8个分类变量)用于根据是否达到每天10000步的标准对参与者进行分类。使用曲线下面积(AUC)和误差评估比较分类与回归树(CART)、随机森林、自适应提升、逻辑回归、支持向量机和神经网络的结果。
考虑AUC(0.703)、总体误差(18%)和类内误差(28%)时,CART产生了最佳模型。与其他模型相比,逻辑回归的表现也相当不错(AUC 0.675,总体误差23%,类内误差36%)。所有方法给出的变量重要性排名都不同。CART发现体脂、使用SF - 12身体成分总结(PCS)的生活质量以及胆固醇与高密度脂蛋白比值是达到10000步标准的最重要预测因素,而逻辑回归显示SF - 12 PCS、血糖水平和教育程度是最显著的预测因素(P≤0.01)。
不同的结果表明,对于单一数据挖掘方法需要谨慎使用,特别是在具有非线性关系和异常值的数据集以及探索并非研究主要结果的关系时。