Kong Deming, Tao Ye, Xiao Haiyan, Xiong Huini, Wei Weizhong, Cai Miao
Wuhan Children's Hospital (Wuhan Maternal and Child Healthcare Hospital), Tongji Medical College, Huazhong University of Science and Technology, Wuhan, Hubei, China.
Department of Epidemiology, School of Public Health, Sun Yat-sen University, Guangzhou, Guangdong, China.
Front Pediatr. 2024 Jan 31;12:1330420. doi: 10.3389/fped.2024.1330420. eCollection 2024.
To develop and compare different AutoML frameworks and machine learning models to predict premature birth.
The study used a large electronic medical record database to include 715,962 participants who had the principal diagnosis code of childbirth. Three Automatic Machine Learning (AutoML) were used to construct machine learning models including tree-based models, ensembled models, and deep neural networks on the training sample ( = 536,971). The area under the curve (AUC) and training times were used to assess the performance of the prediction models, and feature importance was computed via permutation-shuffling.
The H2O AutoML framework had the highest median AUC of 0.846, followed by AutoGluon (median AUC: 0.840) and Auto-sklearn (median AUC: 0.820), and the median training time was the lowest for H2O AutoML (0.14 min), followed by AutoGluon (0.16 min) and Auto-sklearn (4.33 min). Among different types of machine learning models, the Gradient Boosting Machines (GBM) or Extreme Gradient Boosting (XGBoost), stacked ensemble, and random forrest models had better predictive performance, with median AUC scores being 0.846, 0.846, and 0.842, respectively. Important features related to preterm birth included premature rupture of membrane (PROM), incompetent cervix, occupation, and preeclampsia.
Our study highlights the potential of machine learning models in predicting the risk of preterm birth using readily available electronic medical record data, which have significant implications for improving prenatal care and outcomes.
开发并比较不同的自动机器学习(AutoML)框架和机器学习模型以预测早产。
本研究使用了一个大型电子病历数据库,纳入了715962名以分娩为主诊断代码的参与者。在训练样本(n = 536971)上,使用三种自动机器学习(AutoML)方法构建机器学习模型,包括基于树的模型、集成模型和深度神经网络。曲线下面积(AUC)和训练时间用于评估预测模型的性能,特征重要性通过排列洗牌计算。
H2O AutoML框架的中位数AUC最高,为0.846,其次是AutoGluon(中位数AUC:0.840)和Auto-sklearn(中位数AUC:0.820),H2O AutoML的中位数训练时间最短(0.14分钟),其次是AutoGluon(0.16分钟)和Auto-sklearn(4.33分钟)。在不同类型的机器学习模型中,梯度提升机(GBM)或极端梯度提升(XGBoost)、堆叠集成和随机森林模型具有更好的预测性能,中位数AUC分数分别为0.846、0.846和0.842。与早产相关的重要特征包括胎膜早破(PROM)、宫颈机能不全、职业和先兆子痫。
我们的研究突出了机器学习模型利用现成的电子病历数据预测早产风险的潜力,这对改善产前护理和结局具有重要意义。