使用机器学习方法预测先兆流产风险：一项比较研究。

Predicting the risk of threatened abortion using machine learning methods: a comparative study.

作者信息

Zhu Zhenning, Wei Na, Guo Junjie, Yue Changlei, Chen Chao, Zhang Zicheng, Wu Shiyu, Su Jie, Song Biao

机构信息

The Second Affiliated Hospital of Shaanxi University of Chinese Medicine, Gynecology Department, Xianyang, 712000, China.

Beijing Goldwind Yi Tong Technology Co., LTD, Beijing, 100000, China.

出版信息

BMC Pregnancy Childbirth. 2025 Aug 30;25(1):901. doi: 10.1186/s12884-025-08030-z.

DOI:10.1186/s12884-025-08030-z

PMID:40885888

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12398114/

Abstract

BACKGROUND AND OBJECTIVE

Threatened abortion, a common pregnancy complication that often leading to abortion, is hard to predict due to its non-specific symptoms and difficulty in differentiating from other early pregnancy bleeding causes. Current diagnostic methods like serial ultrasounds and clinical monitoring are time-consuming and lack timeliness. To fill the gap in using advanced analytics for early detection and risk stratification, this study develops a machine learning (ML) model based on routine blood data to better predict threatened abortion, providing a reference for early detection and intervention.

METHODS

In this study, we collected medical records from January 2022 to March 2024. We analyzed data from 1764 patients with threatened abortion and 1489 healthy controls. Blood test data of all participants were gathered. The Z-score normalization technique was applied to standardize blood routine indicators. This reduced the influence of outliers and noise. During hyperparameter optimization, 'class_weight="balanced"' was set to handle sample imbalance. The screening data was partitioned into a training set of 2928 cases (including the validation set) and a test set of 325 cases at an 8:1:1 ratio. Python was used to facilitate data transformation. Eight different ML algorithms-Logistic Regression (LR), Random Forest (RF), Support Vector Machine (SVM), Gradient Boosting (GBM), Extreme Gradient Boosting (XGB), Deep Neural Network (DNN), Decision Tree (DT) and Naive Bayes (NB)-were used to construct a threatened abortion prediction model. The prediction performances of the ML models were evaluated by calculating the area under the curve (AUC) values. We used the SHapley Additive exPlanation (SHAP) method to explain the models.

RESULTS

Comparatively, the DNN model showed the highest predictive performance among the eight models, with the highest AUC value of 96.76% and top metrics for accuracy (91.88%), specificity (91.62%), sensitivity (92.11%), and F1 score (92.48%). SHAP analysis identified Red Cell Distribution Width - Standard Deviation (RDW-SD), Platelet Distribution Width (PDW), Mean Platelet Volume (MPV), Red Cell Distribution Width - Coefficient of Variation (RDW-CV), Absolute Basophil Count (BAS#), Platelet Count (PLT), Mean Corpuscular Hemoglobin Concentration (MCHC) and Lymphocyte Percentage (LYM) as the most influential features in predicting threatened abortion, with PDW, RDW-CV, BAS#, PLT, MCHC and LYM positively contributing to the prediction, whereas RDW-SD and MPV had negative contributions.

CONCLUSIONS

Our research on constructing a prediction model for threatened abortion through routine blood tests has revealed the great potential of ML algorithms in detecting threatened abortion. This algorithm is expected to analyse routine blood data to identify at-risk pregnancies at an early stage, significantly improving the early detection of this common pregnancy complication. It will assist healthcare providers in intervening earlier and reducing the incidence of abortion. However, before the model can be translated into routine clinical applications, more extensive validation studies are still needed.

摘要

背景与目的

先兆流产是一种常见的妊娠并发症，常导致流产，因其症状不具特异性且难以与其他早期妊娠出血原因相区分，故而难以预测。当前诸如连续超声检查和临床监测等诊断方法耗时且缺乏及时性。为填补利用先进分析方法进行早期检测和风险分层方面的空白，本研究基于常规血液数据开发了一种机器学习（ML）模型，以更好地预测先兆流产，为早期检测和干预提供参考。

方法

在本研究中，我们收集了2022年1月至2024年3月的医疗记录。我们分析了1764例先兆流产患者和1489例健康对照的数据。收集了所有参与者的血液检测数据。应用Z分数归一化技术对血常规指标进行标准化。这减少了异常值和噪声的影响。在超参数优化过程中，设置“class_weight = 'balanced'”以处理样本不均衡问题。筛选后的数据按8:1:1的比例划分为包含2928例病例的训练集（包括验证集）和325例病例的测试集。使用Python来促进数据转换。使用八种不同的ML算法——逻辑回归（LR）、随机森林（RF）、支持向量机（SVM）、梯度提升（GBM）、极端梯度提升（XGB）、深度神经网络（DNN）、决策树（DT）和朴素贝叶斯（NB）——构建先兆流产预测模型。通过计算曲线下面积（AUC）值来评估ML模型的预测性能。我们使用SHapley加法解释（SHAP）方法来解释模型。

结果

相比之下，DNN模型在八个模型中表现出最高的预测性能，最高AUC值为96.76%，在准确性（91.88%）、特异性（91.62%）、敏感性（92.11%）和F1分数（92.48%）方面也表现出色。SHAP分析确定红细胞分布宽度 - 标准差（RDW - SD）、血小板分布宽度（PDW）、平均血小板体积（MPV）、红细胞分布宽度 - 变异系数（RDW - CV）、嗜碱性粒细胞绝对值（BAS#）、血小板计数（PLT）、平均红细胞血红蛋白浓度（MCHC）和淋巴细胞百分比（LYM）是预测先兆流产最具影响力的特征，其中PDW、RDW - CV、BAS#、PLT、MCHC和LYM对预测有正向贡献，而RDW - SD和MPV有负向贡献。