Zhang Zikang, Peng Wei, Sun Shaoming, Ma Jianguo, Sun Yining, Zhang Fangwen
Hefei Institutes of Physical Science, Chinese Academy of Sciences, Hefei, 230031, PR China.
University of Science and Technology of China, Hefei, 230026, PR China.
Endocrine. 2024 Nov;86(2):600-611. doi: 10.1007/s12020-024-03902-4. Epub 2024 Jun 10.
This study aimed to develop and evaluate machine-learning models for predicting the onset of overweight in adolescents aged 14‒17, utilizing easily collectible personal information.
This study was a one-year prospective cohort study. Baseline data were collected through anthropometric measurements and questionnaires, and the incidence of overweight was calculated one year later via anthropometric measurements. Predictive factors were selected through univariate analysis. Six machine-learning models were developed for predicting the onset of overweight. The SHapley Additive exPlanations (SHAP) was used for global and local interpretation of the models.
Out of 1,241 adolescents, 204 (16.4%) were identified as overweight after one year. Nineteen features were associated with the overweight incidence in univariable analysis. Participants were randomly divided into a training group and a testing group in a 7:3 ratio. The Light Gradient Boosting Machine (LGBM) algorithm achieved outperformed other models, achieving the following metrics: Accuracy (0.956), Recall (0.812), Specificity (0.983), F1-score (0.855), AUC (0.961). Importance ranking revealed that the top 11 minimal feature set can maintain the stability of model performance.
The onset of overweight in adolescents was accurately predicted using easily collectible personal information. The LGBM-based model exhibited superior performance. Oversampling technique notably improved model performance. The model interpretation technique provided innovative strategies for managing adolescent overweight/obesity.
本研究旨在开发并评估利用易于收集的个人信息预测14至17岁青少年超重发病情况的机器学习模型。
本研究为为期一年的前瞻性队列研究。通过人体测量和问卷调查收集基线数据,并在一年后通过人体测量计算超重发病率。通过单因素分析选择预测因素。开发了六个用于预测超重发病的机器学习模型。使用SHapley加法解释(SHAP)对模型进行全局和局部解释。
在1241名青少年中,一年后有204名(16.4%)被确定为超重。单因素分析中有19个特征与超重发病率相关。参与者以7:3的比例随机分为训练组和测试组。轻梯度提升机(LGBM)算法的表现优于其他模型,取得了以下指标:准确率(0.956)、召回率(0.812)、特异性(0.983)、F1分数(0.855)、曲线下面积(AUC,0.961)。重要性排序显示,前11个最小特征集可保持模型性能的稳定性。
利用易于收集的个人信息可准确预测青少年超重的发病情况。基于LGBM的模型表现出卓越性能。过采样技术显著提高了模型性能。模型解释技术为管理青少年超重/肥胖提供了创新策略。