子痫前期预测的进展：一种定制的机器学习管道，集成重采样和集成模型以处理不平衡的医学数据。

Advancing preeclampsia prediction: a tailored machine learning pipeline integrating resampling and ensemble models for handling imbalanced medical data.

作者信息

Ma Yinyao, Lv Hanlin, Ma Yanhua, Wang Xiao, Lv Longting, Liang Xuxia, Wang Lei

机构信息

Department of Obstetrics, People's Hospital of Guangxi Zhuang Autonomous Region, Nanning, 530016, China.

BGI Research, Wuhan, 430074, China.

出版信息

BioData Min. 2025 Mar 24;18(1):25. doi: 10.1186/s13040-025-00440-1.

DOI:10.1186/s13040-025-00440-1

PMID:40128863

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11934807/

Abstract

BACKGROUND

Constructing a predictive model is challenging in imbalanced medical dataset (such as preeclampsia), particularly when employing ensemble machine learning algorithms.

OBJECTIVE

This study aims to develop a robust pipeline that enhances the predictive performance of ensemble machine learning models for the early prediction of preeclampsia in an imbalanced dataset.

METHODS

Our research establishes a comprehensive pipeline optimized for early preeclampsia prediction in imbalanced medical datasets. We gathered electronic health records from pregnant women at the People's Hospital of Guangxi from 2015 to 2020, with additional external validation using three public datasets. This extensive data collection facilitated the systematic assessment of various resampling techniques, varied minority-to-majority ratios, and ensemble machine learning algorithms through a structured evaluation process. We analyzed 4,608 combinations of model settings against performance metrics such as G-mean, MCC, AP, and AUC to determine the most effective configurations. Advanced statistical analyses including OLS regression, ANOVA, and Kruskal-Wallis tests were utilized to fine-tune these settings, enhancing model performance and robustness for clinical application.

RESULTS

Our analysis confirmed the significant impact of systematic sequential optimization of variables on the predictive performance of our models. The most effective configuration utilized the Inverse Weighted Gaussian Mixture Model for resampling, combined with Gradient Boosting Decision Trees algorithm, and an optimized minority-to-majority ratio of 0.09, achieving a Geometric Mean of 0.6694 (95% confidence interval: 0.5855-0.7557). This configuration significantly outperformed the baseline across all evaluated metrics, demonstrating substantial improvements in model performance.

CONCLUSIONS

This study establishes a robust pipeline that significantly enhances the predictive performance of models for preeclampsia within imbalanced datasets. Our findings underscore the importance of a strategic approach to variable optimization in medical diagnostics, offering potential for broad application in various medical contexts where class imbalance is a concern.

摘要

背景

在不平衡的医学数据集（如先兆子痫）中构建预测模型具有挑战性，尤其是在采用集成机器学习算法时。

目的

本研究旨在开发一种强大的流程，以提高集成机器学习模型在不平衡数据集中对先兆子痫进行早期预测的性能。

方法

我们的研究建立了一个针对不平衡医学数据集中早期先兆子痫预测进行优化的综合流程。我们收集了2015年至2020年广西壮族自治区人民医院孕妇的电子健康记录，并使用三个公共数据集进行额外的外部验证。通过结构化评估过程，这种广泛的数据收集有助于系统评估各种重采样技术、不同的少数类与多数类比例以及集成机器学习算法。我们针对诸如G均值、MCC、AP和AUC等性能指标分析了4608种模型设置组合，以确定最有效的配置。利用包括OLS回归、方差分析和Kruskal-Wallis检验在内的高级统计分析来微调这些设置，提高模型在临床应用中的性能和稳健性。

结果

我们的分析证实了变量的系统顺序优化对模型预测性能有显著影响。最有效的配置是使用逆加权高斯混合模型进行重采样，结合梯度提升决策树算法，以及优化后的少数类与多数类比例0.09，几何均值达到0.6694（95%置信区间：0.5855 - 0.7557）。在所有评估指标上，该配置均显著优于基线，表明模型性能有大幅提升。

结论

本研究建立了一个强大的流程，显著提高了不平衡数据集中先兆子痫模型的预测性能。我们的研究结果强调了在医学诊断中采用策略性方法进行变量优化的重要性，为在各种存在类别不平衡问题的医学背景中的广泛应用提供了潜力。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e9d8/11934807/ccc3c21a70bb/13040_2025_440_Fig1_HTML.jpg

相似文献

Advancing preeclampsia prediction: a tailored machine learning pipeline integrating resampling and ensemble models for handling imbalanced medical data.子痫前期预测的进展：一种定制的机器学习管道，集成重采样和集成模型以处理不平衡的医学数据。

BioData Min. 2025 Mar 24;18(1):25. doi: 10.1186/s13040-025-00440-1.

Improving Surgical Site Infection Prediction Using Machine Learning: Addressing Challenges of Highly Imbalanced Data.使用机器学习改善手术部位感染预测：应对高度不平衡数据的挑战。

Diagnostics (Basel). 2025 Feb 19;15(4):501. doi: 10.3390/diagnostics15040501.

Joint modeling strategy for using electronic medical records data to build machine learning models: an example of intracerebral hemorrhage.利用电子病历数据构建机器学习模型的联合建模策略：以脑出血为例。

BMC Med Inform Decis Mak. 2022 Oct 25;22(1):278. doi: 10.1186/s12911-022-02018-x.

The effect of resampling techniques on the performances of machine learning clinical risk prediction models in the setting of severe class imbalance: development and internal validation in a retrospective cohort.重采样技术对严重类别不平衡情况下机器学习临床风险预测模型性能的影响：一项回顾性队列研究中的开发与内部验证

Discov Artif Intell. 2024;4(1):91. doi: 10.1007/s44163-024-00199-0. Epub 2024 Nov 26.

Robust predictive framework for diabetes classification using optimized machine learning on imbalanced datasets.使用优化的机器学习方法对不平衡数据集进行糖尿病分类的稳健预测框架。

Front Artif Intell. 2025 Jan 7;7:1499530. doi: 10.3389/frai.2024.1499530. eCollection 2024.

Learning from Imbalanced Data: Integration of Advanced Resampling Techniques and Machine Learning Models for Enhanced Cancer Diagnosis and Prognosis.从不平衡数据中学习：先进重采样技术与机器学习模型的整合用于增强癌症诊断与预后

Cancers (Basel). 2024 Oct 8;16(19):3417. doi: 10.3390/cancers16193417.

Ensemble machine learning model trained on a new synthesized dataset generalizes well for stress prediction using wearable devices.在新合成数据集上训练的集成机器学习模型，对于使用可穿戴设备进行压力预测具有良好的泛化能力。

J Biomed Inform. 2023 Dec;148:104556. doi: 10.1016/j.jbi.2023.104556. Epub 2023 Dec 2.

Interaction effect between data discretization and data resampling for class-imbalanced medical datasets.类别不均衡医学数据集的数据离散化与数据重采样之间的交互作用。

Technol Health Care. 2025 Mar;33(2):1000-1013. doi: 10.1177/09287329241295874. Epub 2024 Nov 25.

A Novel Explainable Attention-Based Meta-Learning Framework for Imbalanced Brain Stroke Prediction.一种用于不平衡脑卒预测的基于可解释注意力的新型元学习框架。

Sensors (Basel). 2025 Mar 11;25(6):1739. doi: 10.3390/s25061739.

Combining Resampling Strategies and Ensemble Machine Learning Methods to Enhance Prediction of Neonates with a Low Apgar Score After Induction of Labor in Northern Tanzania.结合重采样策略和集成机器学习方法以增强对坦桑尼亚北部引产术后低阿氏评分新生儿的预测

Risk Manag Healthc Policy. 2021 Sep 7;14:3711-3720. doi: 10.2147/RMHP.S331077. eCollection 2021.

本文引用的文献

An Entity Extraction Pipeline for Medical Text Records Using Large Language Models: Analytical Study.基于大型语言模型的医疗文本记录实体抽取流水线：分析研究。

J Med Internet Res. 2024 Mar 29;26:e54580. doi: 10.2196/54580.

An early screening model for preeclampsia: utilizing zero-cost maternal predictors exclusively.一种用于子痫前期的早期筛查模型：仅利用零成本的产妇预测指标。

Hypertens Res. 2024 Apr;47(4):1051-1062. doi: 10.1038/s41440-023-01573-8. Epub 2024 Feb 7.

Improving machine learning with ensemble learning on observational healthcare data.利用基于观测性医疗保健数据的集成学习来改进机器学习。

AMIA Annu Symp Proc. 2024 Jan 11;2023:521-529. eCollection 2023.

Cell-free DNA methylome analysis for early preeclampsia prediction.用于早期子痫前期预测的游离DNA甲基化组分析

Nat Med. 2023 Sep;29(9):2206-2215. doi: 10.1038/s41591-023-02510-5. Epub 2023 Aug 28.

Implications of resampling data to address the class imbalance problem (IRCIP): an evaluation of impact on performance between classification algorithms in medical data.重采样数据以解决类别不平衡问题的影响（IRCIP）：医学数据中分类算法间性能影响的评估

JAMIA Open. 2023 May 31;6(2):ooad033. doi: 10.1093/jamiaopen/ooad033. eCollection 2023 Jul.

Noninvasive preeclampsia prediction using plasma cell-free RNA signatures.使用血浆无细胞 RNA 特征进行非侵入性子痫前期预测。

Am J Obstet Gynecol. 2023 Nov;229(5):553.e1-553.e16. doi: 10.1016/j.ajog.2023.05.015. Epub 2023 May 19.

The Matthews correlation coefficient (MCC) should replace the ROC AUC as the standard metric for assessing binary classification.马修斯相关系数（MCC）应取代受试者工作特征曲线下面积（ROC AUC），作为评估二元分类的标准指标。

BioData Min. 2023 Feb 17;16(1):4. doi: 10.1186/s13040-023-00322-4.

Pre-eclampsia.子痫前期。

Nat Rev Dis Primers. 2023 Feb 16;9(1):8. doi: 10.1038/s41572-023-00417-6.

Predict DLBCL patients' recurrence within two years with Gaussian mixture model cluster oversampling and multi-kernel learning.用高斯混合模型聚类过采样和多内核学习预测两年内弥漫性大 B 细胞淋巴瘤患者的复发情况。

Comput Methods Programs Biomed. 2022 Nov;226:107103. doi: 10.1016/j.cmpb.2022.107103. Epub 2022 Sep 5.

Preeclampsia.子痫前期

N Engl J Med. 2022 May 12;386(19):1817-1832. doi: 10.1056/NEJMra2109523.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

子痫前期预测的进展：一种定制的机器学习管道，集成重采样和集成模型以处理不平衡的医学数据。

Advancing preeclampsia prediction: a tailored machine learning pipeline integrating resampling and ensemble models for handling imbalanced medical data.

作者信息

机构信息

出版信息

BACKGROUND

OBJECTIVE

METHODS

RESULTS

CONCLUSIONS

背景

目的

方法

结果

结论

相似文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

本文引用的文献