低出生体重婴儿结局预测中数据不平衡问题及相关危险因素识别：应用数据再平衡策略建立基准机器学习模型。

Issue of Data Imbalance on Low Birthweight Baby Outcomes Prediction and Associated Risk Factors Identification: Establishment of Benchmarking Key Machine Learning Models With Data Rebalancing Strategies.

机构信息

Department of Computer Science, University of South Carolina, Columbia, SC, United States.

Department of Integrated Information Technology, University of South Carolina, Columbia, SC, United States.

出版信息

J Med Internet Res. 2023 May 31;25:e44081. doi: 10.2196/44081.

DOI:10.2196/44081

PMID:37256674

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC10267797/

Abstract

BACKGROUND

Low birthweight (LBW) is a leading cause of neonatal mortality in the United States and a major causative factor of adverse health effects in newborns. Identifying high-risk patients early in prenatal care is crucial to preventing adverse outcomes. Previous studies have proposed various machine learning (ML) models for LBW prediction task, but they were limited by small and imbalanced data sets. Some authors attempted to address this through different data rebalancing methods. However, most of their reported performances did not reflect the models' actual performance in real-life scenarios. To date, few studies have successfully benchmarked the performance of ML models in maternal health; thus, it is critical to establish benchmarks to advance ML use to subsequently improve birth outcomes.

OBJECTIVE

This study aimed to establish several key benchmarking ML models to predict LBW and systematically apply different rebalancing optimization methods to a large-scale and extremely imbalanced all-payer hospital record data set that connects mother and baby data at a state level in the United States. We also performed feature importance analysis to identify the most contributing features in the LBW classification task, which can aid in targeted intervention.

METHODS

Our large data set consisted of 266,687 birth records across 6 years, and 8.63% (n=23,019) of records were labeled as LBW. To set up benchmarking ML models to predict LBW, we applied 7 classic ML models (ie, logistic regression, naive Bayes, random forest, extreme gradient boosting, adaptive boosting, multilayer perceptron, and sequential artificial neural network) while using 4 different data rebalancing methods: random undersampling, random oversampling, synthetic minority oversampling technique, and weight rebalancing. Owing to ethical considerations, in addition to ML evaluation metrics, we primarily used recall to evaluate model performance, indicating the number of correctly predicted LBW cases out of all actual LBW cases, as false negative health care outcomes could be fatal. We further analyzed feature importance to explore the degree to which each feature contributed to ML model prediction among our best-performing models.

RESULTS

We found that extreme gradient boosting achieved the highest recall score-0.70-using the weight rebalancing method. Our results showed that various data rebalancing methods improved the prediction performance of the LBW group substantially. From the feature importance analysis, maternal race, age, payment source, sum of predelivery emergency department and inpatient hospitalizations, predelivery disease profile, and different social vulnerability index components were important risk factors associated with LBW.

CONCLUSIONS

Our findings establish useful ML benchmarks to improve birth outcomes in the maternal health domain. They are informative to identify the minority class (ie, LBW) based on an extremely imbalanced data set, which may guide the development of personalized LBW early prevention, clinical interventions, and statewide maternal and infant health policy changes.

摘要

背景

低出生体重（LBW）是美国新生儿死亡的主要原因，也是新生儿健康不良的主要致病因素。在产前护理中尽早识别高危患者对于预防不良后果至关重要。先前的研究已经提出了各种用于 LBW 预测任务的机器学习（ML）模型，但它们受到小且不平衡数据集的限制。一些作者试图通过不同的数据重平衡方法来解决这个问题。然而，他们报告的大多数性能并不能反映模型在实际场景中的实际性能。迄今为止，很少有研究成功地对 ML 模型在孕产妇健康方面的性能进行基准测试；因此，建立基准来推进 ML 的使用以随后改善出生结果至关重要。

目的

本研究旨在建立几个关键的基准 ML 模型来预测 LBW，并系统地应用不同的重平衡优化方法来处理一个大规模且极度不平衡的全付费医院记录数据集，该数据集在美国州一级连接了母婴数据。我们还进行了特征重要性分析，以确定 LBW 分类任务中最有贡献的特征，这有助于有针对性的干预。

方法

我们的大型数据集由 266687 份出生记录组成，跨越 6 年，其中 8.63%（n=23019）的记录被标记为 LBW。为了建立预测 LBW 的基准 ML 模型，我们应用了 7 种经典的 ML 模型（即逻辑回归、朴素贝叶斯、随机森林、极端梯度提升、自适应提升、多层感知机和顺序人工神经网络），同时使用了 4 种不同的数据重平衡方法：随机欠采样、随机过采样、合成少数过采样技术和加权重平衡。由于伦理考虑，除了 ML 评估指标外，我们主要使用召回率来评估模型性能，它表示所有实际 LBW 病例中正确预测的 LBW 病例数量，因为错误的阴性医疗保健结果可能是致命的。我们进一步分析了特征重要性，以探索在我们表现最佳的模型中，每个特征对 ML 模型预测的贡献程度。

结果

我们发现，在使用加权重平衡方法时，极端梯度提升达到了最高的召回率-0.70。我们的结果表明，各种数据重平衡方法极大地提高了 LBW 组的预测性能。从特征重要性分析中，我们发现产妇种族、年龄、支付来源、分娩前急诊室和住院治疗的总和、分娩前疾病特征以及不同的社会脆弱性指数成分是与 LBW 相关的重要危险因素。

结论

我们的研究结果建立了有用的 ML 基准，以改善孕产妇健康领域的出生结果。它们有助于根据极度不平衡的数据集识别少数群体（即 LBW），这可能有助于制定个性化的 LBW 早期预防、临床干预和全州母婴健康政策的改变。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1e18/10267797/244a92be8f6f/jmir_v25i1e44081_fig1.jpg

相似文献

Issue of Data Imbalance on Low Birthweight Baby Outcomes Prediction and Associated Risk Factors Identification: Establishment of Benchmarking Key Machine Learning Models With Data Rebalancing Strategies.

J Med Internet Res. 2023 May 31;25:e44081. doi: 10.2196/44081.

Prediction and feature selection of low birth weight using machine learning algorithms.

J Health Popul Nutr. 2024 Oct 12;43(1):157. doi: 10.1186/s41043-024-00647-8.

Folic acid supplementation and malaria susceptibility and severity among people taking antifolate antimalarial drugs in endemic areas.

Cochrane Database Syst Rev. 2022 Feb 1;2(2022):CD014217. doi: 10.1002/14651858.CD014217.

Machine learning-based approach for predicting low birth weight.

BMC Pregnancy Childbirth. 2023 Nov 20;23(1):803. doi: 10.1186/s12884-023-06128-w.

Prediction of low Apgar score at five minutes following labor induction intervention in vaginal deliveries: machine learning approach for imbalanced data at a tertiary hospital in North Tanzania.

BMC Pregnancy Childbirth. 2022 Apr 1;22(1):275. doi: 10.1186/s12884-022-04534-0.

Infant birth weight estimation and low birth weight classification in United Arab Emirates using machine learning algorithms.

Sci Rep. 2022 Jul 15;12(1):12110. doi: 10.1038/s41598-022-14393-6.

[Risk factors for low birth weight and intrauterine growth retardation in Santiago, Chile].

Rev Med Chil. 1993 Oct;121(10):1210-9.

Machine learning algorithms for predicting low birth weight in Ethiopia.

BMC Med Inform Decis Mak. 2022 Sep 5;22(1):232. doi: 10.1186/s12911-022-01981-9.

Prediction of metabolic and pre-metabolic syndromes using machine learning models with anthropometric, lifestyle, and biochemical factors from a middle-aged population in Korea.

BMC Public Health. 2022 Apr 6;22(1):664. doi: 10.1186/s12889-022-13131-x.

Explainable Machine Learning Techniques To Predict Amiodarone-Induced Thyroid Dysfunction Risk: Multicenter, Retrospective Study With External Validation.

J Med Internet Res. 2023 Feb 7;25:e43734. doi: 10.2196/43734.

引用本文的文献

Predicting major amputation risk in diabetic foot ulcers using comparative machine learning models for enhanced clinical decision-making.

Sci Rep. 2025 Aug 1;15(1):28103. doi: 10.1038/s41598-025-13534-x.

Artificial Intelligence's Role in Improving Adverse Pregnancy Outcomes: A Scoping Review and Consideration of Ethical Issues.

J Clin Med. 2025 May 30;14(11):3860. doi: 10.3390/jcm14113860.

Predictive Models Using Machine Learning to Identify Fetal Growth Restriction in Patients With Preeclampsia: Development and Evaluation Study.

J Med Internet Res. 2025 May 27;27:e70068. doi: 10.2196/70068.

Construction and validation of prognostic models for acute kidney disease and mortality in patients at risk of malnutrition: an interpretable machine learning approach.

Clin Kidney J. 2025 Mar 13;18(4):sfaf080. doi: 10.1093/ckj/sfaf080. eCollection 2025 Apr.

Risk prediction for acute kidney disease and adverse outcomes in patients with chronic obstructive pulmonary disease: an interpretable machine learning approach.

Ren Fail. 2025 Dec;47(1):2485475. doi: 10.1080/0886022X.2025.2485475. Epub 2025 Apr 7.

Predicting low birth weight risks in pregnant women in Brazil using machine learning algorithms: data from the Araraquara cohort study.

BMC Pregnancy Childbirth. 2025 Mar 19;25(1):320. doi: 10.1186/s12884-025-07351-3.

Fetal Birth Weight Prediction in the Third Trimester: Retrospective Cohort Study and Development of an Ensemble Model.

JMIR Pediatr Parent. 2025 Mar 10;8:e59377. doi: 10.2196/59377.

An efficient interpretable framework for unsupervised low, very low and extreme birth weight detection.

PLoS One. 2025 Jan 30;20(1):e0317843. doi: 10.1371/journal.pone.0317843. eCollection 2025.

Fairness in Low Birthweight Predictive Models: Implications of Excluding Race/Ethnicity.

J Racial Ethn Health Disparities. 2025 Jan 29. doi: 10.1007/s40615-025-02296-x.

Personalized Prediction of Long-Term Renal Function Prognosis Following Nephrectomy Using Interpretable Machine Learning Algorithms: Case-Control Study.

JMIR Med Inform. 2024 Sep 20;12:e52837. doi: 10.2196/52837.

本文引用的文献

Leakage and the reproducibility crisis in machine-learning-based science.

Patterns (N Y). 2023 Aug 4;4(9):100804. doi: 10.1016/j.patter.2023.100804. eCollection 2023 Sep 8.

Machine learning algorithms for predicting low birth weight in Ethiopia.

BMC Med Inform Decis Mak. 2022 Sep 5;22(1):232. doi: 10.1186/s12911-022-01981-9.

Infant birth weight estimation and low birth weight classification in United Arab Emirates using machine learning algorithms.

Sci Rep. 2022 Jul 15;12(1):12110. doi: 10.1038/s41598-022-14393-6.

Predicting risks of low birth weight in Bangladesh with machine learning.

PLoS One. 2022 May 26;17(5):e0267190. doi: 10.1371/journal.pone.0267190. eCollection 2022.

Association between Maternal Birth Weight and Gestational Diabetes Mellitus: A Systematic Review and Meta-Analysis.

J Obstet Gynaecol India. 2022 Apr;72(2):125-133. doi: 10.1007/s13224-022-01645-8. Epub 2022 Apr 7.

Births: Final Data for 2020.

Natl Vital Stat Rep. 2021 Feb;70(17):1-50.

Mortality in the United States, 2020.

NCHS Data Brief. 2021 Dec(427):1-8.

Maternal Obesity and Risk of Low Birth Weight, Fetal Growth Restriction, and Macrosomia: Multiple Analyses.

Nutrients. 2021 Apr 7;13(4):1213. doi: 10.3390/nu13041213.

Identification of Risk Factors Associated with Obesity and Overweight-A Machine Learning Overview.

Sensors (Basel). 2020 May 11;20(9):2734. doi: 10.3390/s20092734.

The Association of Inadequate and Intensive Prenatal Care With Maternal, Fetal, and Infant Outcomes: A Population-Based Study in Manitoba, Canada.

J Obstet Gynaecol Can. 2019 Jul;41(7):947-959. doi: 10.1016/j.jogc.2018.09.006. Epub 2019 Jan 11.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

低出生体重婴儿结局预测中数据不平衡问题及相关危险因素识别：应用数据再平衡策略建立基准机器学习模型。

Issue of Data Imbalance on Low Birthweight Baby Outcomes Prediction and Associated Risk Factors Identification: Establishment of Benchmarking Key Machine Learning Models With Data Rebalancing Strategies.

机构信息

出版信息

BACKGROUND

OBJECTIVE

METHODS

RESULTS

CONCLUSIONS

背景

目的

方法

结果

结论

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献