Silva Anna Beatriz, Rocha Elisson da Silva, Lorenzato João Fausto, Endo Patricia Takako
Universidade de Pernambuco, Pernambuco, Brazil.
PLoS One. 2025 Apr 2;20(3):e0316574. doi: 10.1371/journal.pone.0316574. eCollection 2025.
Premature birth can be defined as birth before 37 weeks of gestation, which is a significant global health issue, being the main cause for neonatal deaths. In this work, we evaluate machine learning models for predicting premature birth using Brazilian sociodemographic and obstetric data, focusing on the challenge of data imbalance, a common problem that can lead to biased predictions. We evaluate five data balancing techniques: Undersampling, Oversampling, and three Hybridsampling configurations where the minority class was increased by factors 2, 3, and 4. The machine learning models, including Decision Tree, Random Forest, and AdaBoost, are trained and evaluated on a dataset of over 483,000 cases. The use of the Hybridsampling approach resulted in an accuracy of 70%, a recall of 64%, and a precision of 74% in the Decision Tree model. Results show that Hybridsampling techniques significantly improves models' performance compared to Undersampling and Oversampling, highlighting the importance of a proper data balancing in predictive models for preterm birth. The relevance of our work is particularly significant for the Brazilian Unified Health System (SUS). By improving the accuracy of premature birth predictions, our models could assist healthcare providers in identifying at-risk pregnancies earlier, allowing for timely interventions. This integration could enhance maternal and neonatal care, reduce the incidence of preterm births, and potentially decrease neonatal mortality, especially in underserved regions.
早产可定义为妊娠37周前出生,这是一个重大的全球健康问题,是新生儿死亡的主要原因。在这项工作中,我们使用巴西的社会人口统计学和产科数据评估用于预测早产的机器学习模型,重点关注数据不平衡这一挑战,这是一个可能导致预测有偏差的常见问题。我们评估了五种数据平衡技术:欠采样、过采样以及三种混合采样配置,其中少数类分别增加了2倍、3倍和4倍。包括决策树、随机森林和AdaBoost在内的机器学习模型在一个超过48.3万个病例的数据集上进行训练和评估。在决策树模型中,使用混合采样方法的准确率为70%,召回率为64%,精确率为74%。结果表明,与欠采样和过采样相比,混合采样技术显著提高了模型的性能,凸显了在早产预测模型中进行适当数据平衡的重要性。我们的工作对于巴西统一卫生系统(SUS)尤为重要。通过提高早产预测的准确性,我们的模型可以帮助医疗保健提供者更早地识别有风险的妊娠,从而进行及时干预。这种整合可以加强孕产妇和新生儿护理,降低早产发生率,并有可能降低新生儿死亡率,特别是在服务不足的地区。