Department of Statistics and Data Science, Jahangirnagar University, Savar, Dhaka, 1342, Bangladesh.
J Health Popul Nutr. 2024 Oct 12;43(1):157. doi: 10.1186/s41043-024-00647-8.
The birth weight of a newborn is a crucial factor that affects their overall health and future well-being. Low birth weight (LBW) is a widespread global issue, which the World Health Organization defines as weighing less than 2,500 g. LBW can have severe negative consequences on an individual's health, including neonatal mortality and various health concerns throughout their life. To address this problem, this study has been conducted using BDHS 2017-2018 data to uncover important aspects of LBW using a variety of machine learning (ML) approaches and to determine the best feature selection technique and best predictive ML model.
To pick out the key features, the Boruta algorithm and wrapper method were used. Logistic Regression (LR) used as traditional method and several machine learning classifiers were then used, including, DT (Decision Tree), SVM (Support Vector Machine), NB (Naïve Bayes), RF (Random Forest), XGBoost (eXtreme Gradient Boosting), and AdaBoost (Adaptive Boosting), to determine the best model for predicting LBW. The model's performance was evaluated based on the specificity, sensitivity, accuracy, F1 score and AUC value.
Result shows, Boruta algorithm identifies eleven significant features including respondent's age, highest education level, educational attainment, wealth index, age at first birth, weight, height, BMI, age at first sexual intercourse, birth order number, and whether the child is a twin. Incorporating Boruta algorithm's significant features, the performance of traditional LR and ML methods including DT, SVM, NB, RF, XGBoost, and AB were evaluated where LR, had a specificity, sensitivity, accuracy and F1 score of 0.85, 0.5, 85.15% and 0.915. While the ML methods DT, SVM, NB, RF, XGBoost, and AB model's respective accuracy values were 85.35%, 85.15%, 84.54%, 81.18%, and 84.41%. Based on the specificity, sensitivity, accuracy, F1 score and AUC, RF (specificity = 0.99, sensitivity = 0.58, accuracy = 85.86%, F1 score = 0.9243, AUC = 0.549) outperformed the other methods. Both the classical (LR) and machine learning (ML) models' performance has improved dramatically when important characteristics are extracted using the wrapper method. The LR method identified five significant features with a specificity, sensitivity, accuracy and F1 score of 0.87, 0.33, 87.12% and 0.9309. The region, whether the infant is a twin, and cesarean delivery were the three key features discovered by the DT and RF models, which were implemented using the wrapper technique. All three models had the identical F1 score of 0.9318. However, "child is twin" was recognized as a significant feature by the SVM, NB, and AB models, with an F1 score of 0.9315. Ultimately, with an F1 score of 0.9315, the XGBoost model recognized "child is twin" and "age at first sex" as relevant features. Random Forest again beat the other approaches in this instance.
The study reveals Wrapper method as the optimal feature selection technique. The ML method outperforms traditional methods, with Random Forest (RF) being the most effective predictive model for Low-Birth-Weight prediction. The study suggests that policymakers in Bangladesh can mitigate low birth weight newborns by considering identified risk factors.
新生儿的出生体重是影响其整体健康和未来福祉的关键因素。低出生体重(LBW)是一个普遍存在的全球问题,世界卫生组织将其定义为体重不足 2500 克。LBW 会对个体的健康产生严重的负面影响,包括新生儿死亡率和一生中的各种健康问题。为了解决这个问题,本研究使用 BDHS 2017-2018 数据,使用各种机器学习(ML)方法来揭示 LBW 的重要方面,并确定最佳特征选择技术和最佳预测 ML 模型。
为了挑选出关键特征,使用了 Boruta 算法和包装器方法。逻辑回归(LR)作为传统方法,然后使用了几种机器学习分类器,包括 DT(决策树)、SVM(支持向量机)、NB(朴素贝叶斯)、RF(随机森林)、XGBoost(极端梯度提升)和 AdaBoost(自适应提升),以确定预测 LBW 的最佳模型。根据特异性、敏感性、准确性、F1 分数和 AUC 值评估模型的性能。
结果表明,Boruta 算法确定了 11 个重要特征,包括受访者的年龄、最高教育水平、教育程度、财富指数、首次生育年龄、体重、身高、BMI、首次性行为年龄、出生顺序数以及孩子是否是双胞胎。在纳入 Boruta 算法的重要特征后,传统 LR 和 ML 方法(包括 DT、SVM、NB、RF、XGBoost 和 AB)的性能进行了评估,其中 LR 的特异性、敏感性、准确性和 F1 分数分别为 0.85、0.5、85.15%和 0.915。而 ML 方法 DT、SVM、NB、RF、XGBoost 和 AB 模型的准确性值分别为 85.35%、85.15%、84.54%、81.18%和 84.41%。根据特异性、敏感性、准确性、F1 分数和 AUC,RF(特异性=0.99、敏感性=0.58、准确性=85.86%、F1 分数=0.9243、AUC=0.549)优于其他方法。当使用包装器方法提取重要特征时,经典(LR)和机器学习(ML)模型的性能都有了显著提高。LR 方法确定了五个重要特征,其特异性、敏感性、准确性和 F1 分数分别为 0.87、0.33、87.12%和 0.9309。区域、婴儿是否为双胞胎和剖宫产是 DT 和 RF 模型发现的三个关键特征,这些特征是使用包装器技术实现的。所有三个模型的 F1 分数均相同,为 0.9318。然而,SVM、NB 和 AB 模型将“孩子是双胞胎”识别为重要特征,其 F1 分数为 0.9315。最终,XGBoost 模型以 0.9315 的 F1 分数识别出“孩子是双胞胎”和“首次性行为年龄”为相关特征。在这种情况下,随机森林再次击败了其他方法。
研究表明包装器方法是最佳特征选择技术。ML 方法优于传统方法,随机森林(RF)是预测低出生体重的最有效预测模型。研究表明,孟加拉国的政策制定者可以通过考虑已确定的风险因素来减轻低出生体重新生儿的数量。