1 Department of Public Health, University of Helsinki, Finland.
2 Department of Public Health Solutions, National Institute for Health and Welfare, Finland.
Scand J Public Health. 2018 Jul;46(5):557-564. doi: 10.1177/1403494817736944. Epub 2017 Oct 30.
Factors that contribute to the development of overweight are numerous and form a complex structure with many unknown interactions and associations. We aimed to explore this structure (i.e. the mutual importance or hierarchy of sociodemographic and lifestyle-related risk factors of being overweight) using a machine-learning technique called random forest (RF). The results were compared with traditional logistic regression (LR) analysis.
The cross-sectional FINRISK 2007 Study included 4757 Finns (aged 25-74 years). Information on participants' lifestyle and sociodemographic characteristics were collected with questionnaires. Diet was assessed, using a validated food-frequency questionnaire. Height and weight were measured. Participants with a body mass index (BMI) ≥25 kg/m were classified as overweight. R-statistical software was used to run RF analysis ('randomForest') to derive estimates for variable importance and out-of-bag error, which were compared to a LR model.
In total, 704 (32%) men and 1119 (44%) women had normal BMI, whereas 1502 (69%) men and 1432 (57%) women had BMI ≥25. Estimated error rates for the models were similar (RF vs. LR: 42% vs. 40% for men, 38% vs. 35% for women). Both models ranked age, education and physical activity as the most important risk factors for being overweight, but RF ranked macronutrients (carbohydrates and protein) as more important compared to LR.
RF did not demonstrate higher power in variable selection compared to LR in our study. The features of RF are more likely to appear beneficial in settings with a larger number of predictors.
导致超重的因素很多,形成了一个具有许多未知相互作用和关联的复杂结构。我们旨在使用一种称为随机森林(RF)的机器学习技术来探索这种结构(即超重的社会人口统计学和生活方式相关风险因素的相互重要性或层次结构)。结果与传统的逻辑回归(LR)分析进行了比较。
横断面 FINRISK 2007 研究包括 4757 名芬兰人(年龄 25-74 岁)。使用问卷收集了参与者的生活方式和社会人口统计学特征信息。使用经过验证的食物频率问卷评估饮食。测量身高和体重。BMI≥25kg/m 的参与者被归类为超重。使用 R 统计软件运行 RF 分析('randomForest')以得出变量重要性和袋外误差的估计值,并与 LR 模型进行比较。
共有 704 名(32%)男性和 1119 名(44%)女性的 BMI 正常,而 1502 名(69%)男性和 1432 名(57%)女性的 BMI≥25。模型的估计误差率相似(RF 与 LR:男性为 42%比 40%,女性为 38%比 35%)。两种模型都将年龄、教育程度和体力活动列为超重的最重要危险因素,但 RF 将宏量营养素(碳水化合物和蛋白质)列为比 LR 更重要的危险因素。
在我们的研究中,RF 并没有在变量选择方面表现出比 LR 更高的能力。RF 的特征在预测因子数量较多的情况下更有可能显示出优势。