Sewpaul Ronel, Awe Olushina Olawale, Dogbey Dennis Makafui, Sekgala Machoene Derrick, Dukhi Natisha
Public Health, Societies and Belonging, Human Sciences Research Council, Merchant House, 2 Dock Rail Road, Cape Town 8001, South Africa.
Institute of Mathematics, Statistics and Scientific Computing (IMECC), University of Campinas, Campinas 13083-859, Brazil.
Int J Environ Res Public Health. 2023 Dec 19;21(1):2. doi: 10.3390/ijerph21010002.
This study evaluates the performance of logistic regression (LR) and random forest (RF) algorithms to model obesity among female adolescents in South Africa.
Data was analysed on 375 females aged 15-17 from the South African National Health and Nutrition Examination Survey 2011/2012. The primary outcome was obesity, defined as body mass index (BMI) ≥ 30 kg/m. A total of 31 explanatory variables were included, ranging from socio-economic, demographic, family history, dietary and health behaviour. RF and LR models were run using imbalanced data as well as after oversampling, undersampling, and hybrid sampling of the data.
Using the imbalanced data, the RF model performed better with higher precision, recall, F1 score, and balanced accuracy. Balanced accuracy was highest with the hybrid data (0.618 for RF and 0.668 for LR). Using the hybrid balanced data, the RF model performed better (F1-score = 0.940 for RF vs. 0.798 for LR).
The model with the highest overall performance metrics was the RF model both before balancing the data and after applying hybrid balancing. Future work would benefit from using larger datasets on adolescent female obesity to assess the robustness of the models.
本研究评估了逻辑回归(LR)和随机森林(RF)算法在为南非女性青少年肥胖建模方面的性能。
对来自2011/2012年南非国家健康与营养检查调查的375名15 - 17岁女性的数据进行了分析。主要结局是肥胖,定义为体重指数(BMI)≥30 kg/m²。总共纳入了31个解释变量,范围涵盖社会经济、人口统计学、家族史、饮食和健康行为。使用不平衡数据以及对数据进行过采样、欠采样和混合采样后运行RF和LR模型。
使用不平衡数据时,RF模型在精度、召回率、F1分数和平衡准确率方面表现更好。混合数据的平衡准确率最高(RF为0.618,LR为0.668)。使用混合平衡数据时,RF模型表现更佳(RF的F1分数 = 0.940,LR为0.798)。
在平衡数据之前和应用混合平衡之后,总体性能指标最高的模型都是RF模型。未来的工作将受益于使用关于青少年女性肥胖的更大数据集来评估模型的稳健性。