利用机器学习预测早产：基于中国深圳大规模学龄前儿童调查数据的综合分析

Prediction of preterm birth using machine learning: a comprehensive analysis based on large-scale preschool children survey data in Shenzhen of China.

作者信息

Ding Liwen, Yin Xiaona, Wen Guomin, Sun Dengli, Xian Danxia, Zhao Yafen, Zhang Maolin, Yang Weikang, Chen Weiqing

机构信息

Department of Epidemiology and Health Statistics, School of Public Health, Sun Yat-Sen University, Guangzhou, 510080, China.

Women's and Children's Hospital of Longhua District of Shenzhen, Shenzhen, 518109, China.

出版信息

BMC Pregnancy Childbirth. 2024 Dec 4;24(1):810. doi: 10.1186/s12884-024-06980-4.

DOI:10.1186/s12884-024-06980-4

PMID:39633287

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11616287/

Abstract

BACKGROUND

Preterm birth (PTB) is a significant cause of neonatal mortality and long-term health issues. Accurate prediction and timely prevention of PTB are essential for reducing associated child mortality and morbidity. Traditional predictive methods face challenges due to heterogeneous risk factors and their interaction effects. This study aims to develop and evaluate six machine learning (ML) models to predict PTB using large-scale children survey data from Shenzhen, China, and to identify key predictors through Shapley Additive Explanations (SHAP) analysis.

METHODS

Data from 84,050 mother-child pairs, collected in 2021 and 2022, were processed and divided into training, validation, and test sets. Six ML models were tested: L1-Regularised Logistic Regression, Light Gradient Boosting Machine (LightGBM), Naive Bayes, Random Forests, Support Vector Machine, and Extreme Gradient Boosting (XGBoost). Model performance was evaluated based on discrimination, calibration and clinical utility. SHAP analysis was used to interpret the importance and impact of individual features on PTB prediction.

RESULTS

The XGBoost model demonstrated the best overall performance, with the area under the receiver operating characteristic curve (AUC) scores of 0.752 and 0.757 in the validation and test sets, respectively, along with favorable calibration and clinical utility. Key predictors identified were multiple pregnancies, threatened abortion, and maternal age of conception. SHAP analysis highlighted the positive impacts of multiple pregnancies and threatened abortion, as well as the negative impact of micronutrient supplementation on PTB.

CONCLUSION

Our study found that ML models, particularly XGBoost, show promise in accurately predicting PTB and identifying key risk factors. These findings provide the potential of ML for enhancing clinical interventions, personalizing prenatal care, and informing public health initiatives.

摘要

背景

早产是新生儿死亡和长期健康问题的重要原因。准确预测和及时预防早产对于降低相关儿童死亡率和发病率至关重要。由于风险因素的异质性及其相互作用，传统预测方法面临挑战。本研究旨在开发和评估六种机器学习（ML）模型，利用来自中国深圳的大规模儿童调查数据预测早产，并通过夏普利值加法解释（SHAP）分析确定关键预测因素。

方法

对2021年和2022年收集的84050对母婴数据进行处理，分为训练集、验证集和测试集。测试了六种ML模型：L1正则化逻辑回归、轻梯度提升机（LightGBM）、朴素贝叶斯、随机森林、支持向量机和极端梯度提升（XGBoost）。基于区分度、校准度和临床实用性评估模型性能。使用SHAP分析来解释个体特征对早产预测的重要性和影响。

结果

XGBoost模型表现出最佳的整体性能，在验证集和测试集中，受试者工作特征曲线（AUC）下面积得分分别为0.752和0.757，同时具有良好的校准度和临床实用性。确定的关键预测因素为多胎妊娠、先兆流产和受孕时的母亲年龄。SHAP分析突出了多胎妊娠和先兆流产的积极影响，以及微量营养素补充对早产的负面影响。