Ma Yinyao, Lv Hanlin, Ma Yanhua, Wang Xiao, Lv Longting, Liang Xuxia, Wang Lei
Department of Obstetrics, People's Hospital of Guangxi Zhuang Autonomous Region, Nanning, 530016, China.
BGI Research, Wuhan, 430074, China.
BioData Min. 2025 Mar 24;18(1):25. doi: 10.1186/s13040-025-00440-1.
Constructing a predictive model is challenging in imbalanced medical dataset (such as preeclampsia), particularly when employing ensemble machine learning algorithms.
This study aims to develop a robust pipeline that enhances the predictive performance of ensemble machine learning models for the early prediction of preeclampsia in an imbalanced dataset.
Our research establishes a comprehensive pipeline optimized for early preeclampsia prediction in imbalanced medical datasets. We gathered electronic health records from pregnant women at the People's Hospital of Guangxi from 2015 to 2020, with additional external validation using three public datasets. This extensive data collection facilitated the systematic assessment of various resampling techniques, varied minority-to-majority ratios, and ensemble machine learning algorithms through a structured evaluation process. We analyzed 4,608 combinations of model settings against performance metrics such as G-mean, MCC, AP, and AUC to determine the most effective configurations. Advanced statistical analyses including OLS regression, ANOVA, and Kruskal-Wallis tests were utilized to fine-tune these settings, enhancing model performance and robustness for clinical application.
Our analysis confirmed the significant impact of systematic sequential optimization of variables on the predictive performance of our models. The most effective configuration utilized the Inverse Weighted Gaussian Mixture Model for resampling, combined with Gradient Boosting Decision Trees algorithm, and an optimized minority-to-majority ratio of 0.09, achieving a Geometric Mean of 0.6694 (95% confidence interval: 0.5855-0.7557). This configuration significantly outperformed the baseline across all evaluated metrics, demonstrating substantial improvements in model performance.
This study establishes a robust pipeline that significantly enhances the predictive performance of models for preeclampsia within imbalanced datasets. Our findings underscore the importance of a strategic approach to variable optimization in medical diagnostics, offering potential for broad application in various medical contexts where class imbalance is a concern.
在不平衡的医学数据集(如先兆子痫)中构建预测模型具有挑战性,尤其是在采用集成机器学习算法时。
本研究旨在开发一种强大的流程,以提高集成机器学习模型在不平衡数据集中对先兆子痫进行早期预测的性能。
我们的研究建立了一个针对不平衡医学数据集中早期先兆子痫预测进行优化的综合流程。我们收集了2015年至2020年广西壮族自治区人民医院孕妇的电子健康记录,并使用三个公共数据集进行额外的外部验证。通过结构化评估过程,这种广泛的数据收集有助于系统评估各种重采样技术、不同的少数类与多数类比例以及集成机器学习算法。我们针对诸如G均值、MCC、AP和AUC等性能指标分析了4608种模型设置组合,以确定最有效的配置。利用包括OLS回归、方差分析和Kruskal-Wallis检验在内的高级统计分析来微调这些设置,提高模型在临床应用中的性能和稳健性。
我们的分析证实了变量的系统顺序优化对模型预测性能有显著影响。最有效的配置是使用逆加权高斯混合模型进行重采样,结合梯度提升决策树算法,以及优化后的少数类与多数类比例0.09,几何均值达到0.6694(95%置信区间:0.5855 - 0.7557)。在所有评估指标上,该配置均显著优于基线,表明模型性能有大幅提升。
本研究建立了一个强大的流程,显著提高了不平衡数据集中先兆子痫模型的预测性能。我们的研究结果强调了在医学诊断中采用策略性方法进行变量优化的重要性,为在各种存在类别不平衡问题的医学背景中的广泛应用提供了潜力。