Arage Fetlework Gubena, Tadese Zinabu Bekele, Taye Eliyas Addisu, Tsegaw Tigist Kifle, Abate Tsegasilassie Gebremariam, Alemu Eyob Akalewold
Department of Public Health, Institute of Public Health, College of Medicine and Health Sciences, University of Gondar, Gondar, Ethiopia.
Department of Epidemiology and Biostatistics, Institute of Public Health, College of Medicine and Health Sciences, University of Gondar, Gondar, Ethiopia.
BMC Med Inform Decis Mak. 2025 May 26;25(1):197. doi: 10.1186/s12911-025-03039-y.
Cervical cancer, which includes squamous cell carcinoma and adenocarcinoma, is a leading cause of cancer-related deaths globally, particularly in low- and middle-income countries (LMICs). It is preventable through early screening, but incidence and mortality rates are significantly higher in LMICs, with 94% of deaths occurring in these regions. Poor implementation of screening programs, in addition to multiple health system barriers, leads to a high burden from cervical cancer in these countries. Projections show increasing cases and deaths due to the disease by 2030. Using machine learning instead of the usual statistical tests will incorporate the complex and non-linear relationship of factors in predicting the outcome variable.
The secondary data for ten Sub-Saharan African countries were utilized from the Demographic and Health Survey, DHS, to evaluate cervical cancer screening uptake among women aged 25-49 years. During cleaning missing values and outliers were removed. Class balancing by Synthetic minority oversampling techniques (SMOT) was done and tuning hyperparameters via grid search was used in the models before splitting into training and validation sets containing 89% and 20%, respectively. The following machine learning classification algorithms were used in the study: Logistic Regression, Decision Tree Classifier, Random Forest, K-Nearest Neighbor, Gradient Boosting, AdaBoost, and Extra Trees. These algorithms were employed to predict cervical cancer screening uptake. The performance of the models was evaluated using accuracy, precision, recall, and F1 score.
In this study, a cervical cancer screening uptake was predicted among 75,360 weighted samples of women from an African country, aged 25-49 with the final data for model formulation of 53,461, where the Extra Trees Classifier obtained an accuracy of 94.13%, a precision of 95.76%, recall of 94.12%, F1-score of 93.80%. Then followed Random Forest: accuracy = 93.87, precision = 99.18%. Health visits, proximity to health care, using contraceptives, residing in urban settings, and exposure to media were its most crucial predictors. The ensemble methods, such as Extra Trees and Random Forest, showed the best generalization, indicating that this work well on complex datasets and can help devise targeted intervention strategies.
This study demonstrates that the ensemble machine learning models, such as Extra Trees Classifier and Random Forest, are promising in predicting cervical cancer screening uptake among African women with accuracies of 94.13% and 93.87%, respectively. Key predictors include healthcare access, sociocultural factors, media exposure, residence in urban areas, and contraceptive use. The findings emphasize the need for a reduction in care barriers and the use of family planning visits and mass media in promoting screening. These results will be validated in different populations in order to find the clinical integration via decision support systems.
宫颈癌包括鳞状细胞癌和腺癌,是全球癌症相关死亡的主要原因,在低收入和中等收入国家(LMICs)尤为突出。通过早期筛查可预防宫颈癌,但在低收入和中等收入国家,其发病率和死亡率显著更高,这些地区的死亡病例占全球的94%。除了多种卫生系统障碍外,筛查项目实施不力导致这些国家宫颈癌负担沉重。预测显示,到2030年,该疾病的病例和死亡人数将增加。使用机器学习而非常规统计测试,将纳入因素之间复杂的非线性关系来预测结果变量。
利用来自人口与健康调查(DHS)的十个撒哈拉以南非洲国家的二手数据,评估25至49岁女性的宫颈癌筛查接受情况。在数据清理过程中,去除了缺失值和异常值。通过合成少数过采样技术(SMOT)进行类别平衡,并在模型中通过网格搜索调整超参数,然后将其分为分别包含89%和20%数据的训练集和验证集。本研究使用了以下机器学习分类算法:逻辑回归、决策树分类器、随机森林、K近邻、梯度提升、自适应增强(AdaBoost)和极端随机树(Extra Trees)。这些算法用于预测宫颈癌筛查接受情况。使用准确率、精确率、召回率和F1分数评估模型性能。
在本研究中,对来自一个非洲国家的75360名加权样本的25至49岁女性的宫颈癌筛查接受情况进行了预测,最终用于模型构建的数据为53461个,其中极端随机树分类器获得的准确率为94.13%,精确率为95.76%,召回率为94.12%,F1分数为93.80%。其次是随机森林:准确率 = 93.87,精确率 = 99.18%。健康检查、与医疗保健机构的距离、使用避孕药具、居住在城市地区以及接触媒体是其最重要的预测因素。诸如极端随机树和随机森林等集成方法显示出最佳的泛化能力,表明该方法在复杂数据集上表现良好,有助于制定有针对性的干预策略。
本研究表明,诸如极端随机树分类器和随机森林等集成机器学习模型在预测非洲女性宫颈癌筛查接受情况方面很有前景,准确率分别为94.13%和93.87%。关键预测因素包括获得医疗保健服务、社会文化因素、接触媒体、居住在城市地区以及使用避孕药具。研究结果强调需要减少医疗保健障碍,并利用计划生育访视和大众媒体来促进筛查。这些结果将在不同人群中进行验证,以便通过决策支持系统实现临床整合。