Melaku Mequannent Sharew, Baykemagn Nebebe Demis, Yohannes Lamrot, Zegeye Adem Tsegaw
Department of Health Informatics, Institute of Public Health, University of Gondar, Gondar, Ethiopia.
Department of Environmental and Occupational Health and Safety, Institute of Public Health, College of Medicine and Health Science, University of Gondar, Gondar, Ethiopia.
Sci Rep. 2025 Jul 9;15(1):24646. doi: 10.1038/s41598-025-09380-6.
Tobacco smoking is a significant public health issue in sub-Saharan Africa, with its prevalence shaped by various demographic factors. This study aimed to model predictors of tobacco use among men in Sub Sahara Africa between 2018 and 2023 using machine learning algorithms. Data from Demographic and Health Surveys covering 147,466 men were analyzed. STATA version 17 was used for data cleaning and descriptive statistics, while Python 3.9 was employed for machine learning predictions. The study utilized several machine learning models, including Decision Tree, Logistic Regression, Random Forest, KNN, eXtreme Gradient Boosting (XGBoost), and AdaBoost, to identify the key predictors of tobacco use among men. Hyperparameter optimization was performed using Randomized Search with tenfold cross-validation, enhancing model performance. The Additive Explanations (SHAP) method was used to assess predictor significance. Model performance was evaluated based on accuracy, precision, recall, F1 score, and area under the curve (AUC). The study found a pooled tobacco use prevalence of 14.73%, with no significant variation between countries. High tobacco use was observed in Mozambique, Zambia, Benin, Mali, Mauritania, Senegal, Guinea, Sierra Leone, and Liberia, with Tanzania, Benin, and Senegal reporting the highest rates. The XGBoost algorithm attained an accuracy of 98% and an AUC score of 97%. SHAP analysis revealed that age, education, wealth index, religion, residence, internet use, occupation, age at first sex, number of sexual partners, and marital status were key predictors. These findings underscore the need for targeted public health interventions and highlight the value of machine learning in identifying at-risk populations and addressing socio-cultural and economic factors influencing tobacco use.
吸烟是撒哈拉以南非洲地区一个重大的公共卫生问题,其流行程度受多种人口因素影响。本研究旨在使用机器学习算法对2018年至2023年撒哈拉以南非洲地区男性烟草使用的预测因素进行建模。分析了来自人口与健康调查的147466名男性的数据。使用STATA 17版本进行数据清理和描述性统计,而使用Python 3.9进行机器学习预测。该研究利用了多种机器学习模型,包括决策树、逻辑回归、随机森林、K近邻、极端梯度提升(XGBoost)和自适应增强(AdaBoost),以确定男性烟草使用的关键预测因素。使用随机搜索和十折交叉验证进行超参数优化,提高了模型性能。使用加法解释(SHAP)方法评估预测因素的重要性。基于准确率、精确率、召回率、F1分数和曲线下面积(AUC)评估模型性能。研究发现,合并后的烟草使用流行率为14.73%,各国之间没有显著差异。在莫桑比克、赞比亚、贝宁、马里、毛里塔尼亚、塞内加尔、几内亚、塞拉利昂和利比里亚观察到高烟草使用率,其中坦桑尼亚、贝宁和塞内加尔报告的使用率最高。XGBoost算法的准确率达到98%,AUC分数为97%。SHAP分析表明,年龄、教育程度、财富指数、宗教、居住地、互联网使用、职业、首次性行为年龄、性伴侣数量和婚姻状况是关键预测因素。这些发现强调了有针对性的公共卫生干预措施的必要性,并突出了机器学习在识别高危人群以及解决影响烟草使用的社会文化和经济因素方面的价值。