Suppr超能文献

子痫前期预测的进展:一种定制的机器学习管道,集成重采样和集成模型以处理不平衡的医学数据。

Advancing preeclampsia prediction: a tailored machine learning pipeline integrating resampling and ensemble models for handling imbalanced medical data.

作者信息

Ma Yinyao, Lv Hanlin, Ma Yanhua, Wang Xiao, Lv Longting, Liang Xuxia, Wang Lei

机构信息

Department of Obstetrics, People's Hospital of Guangxi Zhuang Autonomous Region, Nanning, 530016, China.

BGI Research, Wuhan, 430074, China.

出版信息

BioData Min. 2025 Mar 24;18(1):25. doi: 10.1186/s13040-025-00440-1.

Abstract

BACKGROUND

Constructing a predictive model is challenging in imbalanced medical dataset (such as preeclampsia), particularly when employing ensemble machine learning algorithms.

OBJECTIVE

This study aims to develop a robust pipeline that enhances the predictive performance of ensemble machine learning models for the early prediction of preeclampsia in an imbalanced dataset.

METHODS

Our research establishes a comprehensive pipeline optimized for early preeclampsia prediction in imbalanced medical datasets. We gathered electronic health records from pregnant women at the People's Hospital of Guangxi from 2015 to 2020, with additional external validation using three public datasets. This extensive data collection facilitated the systematic assessment of various resampling techniques, varied minority-to-majority ratios, and ensemble machine learning algorithms through a structured evaluation process. We analyzed 4,608 combinations of model settings against performance metrics such as G-mean, MCC, AP, and AUC to determine the most effective configurations. Advanced statistical analyses including OLS regression, ANOVA, and Kruskal-Wallis tests were utilized to fine-tune these settings, enhancing model performance and robustness for clinical application.

RESULTS

Our analysis confirmed the significant impact of systematic sequential optimization of variables on the predictive performance of our models. The most effective configuration utilized the Inverse Weighted Gaussian Mixture Model for resampling, combined with Gradient Boosting Decision Trees algorithm, and an optimized minority-to-majority ratio of 0.09, achieving a Geometric Mean of 0.6694 (95% confidence interval: 0.5855-0.7557). This configuration significantly outperformed the baseline across all evaluated metrics, demonstrating substantial improvements in model performance.

CONCLUSIONS

This study establishes a robust pipeline that significantly enhances the predictive performance of models for preeclampsia within imbalanced datasets. Our findings underscore the importance of a strategic approach to variable optimization in medical diagnostics, offering potential for broad application in various medical contexts where class imbalance is a concern.

摘要

背景

在不平衡的医学数据集(如先兆子痫)中构建预测模型具有挑战性,尤其是在采用集成机器学习算法时。

目的

本研究旨在开发一种强大的流程,以提高集成机器学习模型在不平衡数据集中对先兆子痫进行早期预测的性能。

方法

我们的研究建立了一个针对不平衡医学数据集中早期先兆子痫预测进行优化的综合流程。我们收集了2015年至2020年广西壮族自治区人民医院孕妇的电子健康记录,并使用三个公共数据集进行额外的外部验证。通过结构化评估过程,这种广泛的数据收集有助于系统评估各种重采样技术、不同的少数类与多数类比例以及集成机器学习算法。我们针对诸如G均值、MCC、AP和AUC等性能指标分析了4608种模型设置组合,以确定最有效的配置。利用包括OLS回归、方差分析和Kruskal-Wallis检验在内的高级统计分析来微调这些设置,提高模型在临床应用中的性能和稳健性。

结果

我们的分析证实了变量的系统顺序优化对模型预测性能有显著影响。最有效的配置是使用逆加权高斯混合模型进行重采样,结合梯度提升决策树算法,以及优化后的少数类与多数类比例0.09,几何均值达到0.6694(95%置信区间:0.5855 - 0.7557)。在所有评估指标上,该配置均显著优于基线,表明模型性能有大幅提升。

结论

本研究建立了一个强大的流程,显著提高了不平衡数据集中先兆子痫模型的预测性能。我们的研究结果强调了在医学诊断中采用策略性方法进行变量优化的重要性,为在各种存在类别不平衡问题的医学背景中的广泛应用提供了潜力。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e9d8/11934807/ccc3c21a70bb/13040_2025_440_Fig1_HTML.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验