Partington Susan N, Papakroni Vasil, Menzies Tim
Division of Animal and Nutritional Sciences, West Virginia University, Morgantown, WV, USA.
BMC Public Health. 2014 Jun 12;14:593. doi: 10.1186/1471-2458-14-593.
Collecting data can be cumbersome and expensive. Lack of relevant, accurate and timely data for research to inform policy may negatively impact public health. The aim of this study was to test if the careful removal of items from two community nutrition surveys guided by a data mining technique called feature selection, can (a) identify a reduced dataset, while (b) not damaging the signal inside that data.
The Nutrition Environment Measures Surveys for stores (NEMS-S) and restaurants (NEMS-R) were completed on 885 retail food outlets in two counties in West Virginia between May and November of 2011. A reduced dataset was identified for each outlet type using feature selection. Coefficients from linear regression modeling were used to weight items in the reduced datasets. Weighted item values were summed with the error term to compute reduced item survey scores. Scores produced by the full survey were compared to the reduced item scores using a Wilcoxon rank-sum test.
Feature selection identified 9 store and 16 restaurant survey items as significant predictors of the score produced from the full survey. The linear regression models built from the reduced feature sets had R2 values of 92% and 94% for restaurant and grocery store data, respectively.
While there are many potentially important variables in any domain, the most useful set may only be a small subset. The use of feature selection in the initial phase of data collection to identify the most influential variables may be a useful tool to greatly reduce the amount of data needed thereby reducing cost.
收集数据可能既繁琐又昂贵。缺乏相关、准确和及时的数据用于研究以指导政策制定可能会对公共卫生产生负面影响。本研究的目的是测试在一种名为特征选择的数据挖掘技术指导下,从两项社区营养调查中谨慎去除项目是否能够(a)识别出一个精简数据集,同时(b)不破坏该数据中的信号。
2011年5月至11月期间,对西弗吉尼亚州两个县的885家零售食品店完成了商店营养环境测量调查(NEMS - S)和餐馆营养环境测量调查(NEMS - R)。使用特征选择为每种店铺类型识别出一个精简数据集。线性回归模型的系数用于对精简数据集中的项目进行加权。加权后的项目值与误差项相加,以计算精简项目调查得分。使用Wilcoxon秩和检验将完整调查产生的得分与精简项目得分进行比较。
特征选择确定了9个商店调查项目和16个餐馆调查项目是完整调查得分的显著预测因子。从精简特征集构建的线性回归模型中,餐馆和杂货店数据的R²值分别为92%和94%。
虽然在任何领域都有许多潜在重要变量,但最有用的集合可能只是一个小子集。在数据收集的初始阶段使用特征选择来识别最具影响力的变量,可能是一种有用的工具,可大幅减少所需数据量,从而降低成本。