基于插补的机器学习框架增强早期妊娠糖尿病预测：对真实世界临床记录的比较研究

Enhancing early gestational diabetes mellitus prediction with imputation-based machine learning framework: A comparative study on real-world clinical records.

作者信息

Ma Leyao, Yang Lin, Wang Yaxin, Hao Jie, Li Yini, Ma Liangkun, Wang Ziyang, Li Ye, Zhang Suhan, Hu Mingyue, Li Jiao, Sun Yin

机构信息

Institute of Medical Information, Chinese Academy of Medical Science & Peking Union Medical College, Beijing, China.

Key Laboratory of Medical Information Intelligent Technology, Chinese Academy of Medical Sciences & Peking Union Medical College, Beijing, China.

出版信息

Digit Health. 2025 Jul 29;11:20552076251352436. doi: 10.1177/20552076251352436. eCollection 2025 Jan-Dec.

DOI:10.1177/20552076251352436

PMID:40755962

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12317186/

Abstract

OBJECTIVE

Gestational diabetes mellitus (GDM) is one of the most common pregnancy complications. Electronic health records (EHRs) promise GDM risk prediction, but missing data poses a challenge to developing reliable and generalizable risk prediction models. This study aims to address the problem of missing EHR data in GDM prediction before 12 weeks gestation.

METHODS

A total of 5066 women with singleton pregnancies, aged 18 to 50, were included in this retrospective study. This study evaluated 6 imputation methods, combined with 4 classification machine learning models. The evaluation encompassed downstream predictive performance, robustness to variable missingness, ability to restore original data distribution, and influence on feature selection based on 10-fold cross-validation.

RESULTS

Our findings revealed a significant improvement in model performance with imputation. When using the top 30 features, logistic regression (LR) with multivariate imputation by chained equations using classification and regression trees (mice) achieved the highest area under the receiver operating characteristic curve of 0.6899, compared to 0.6336 for the LR model without imputation. Mice also led to the best average performance across prediction models and yielded the most accurate restoration of the original data distribution. LR models trained on data imputed by mice remained the most robust across varying levels of missingness. The classification algorithm primarily accounted for differences in predictive performance. In addition, we identified 18 key features for early GDM prediction in the Chinese population.

CONCLUSION

This study demonstrates the critical role of imputation in improving the performance and fairness of GDM prediction models. The findings provide practical guidance for integrating imputation into clinical machine learning pipelines.

摘要

目的

妊娠期糖尿病（GDM）是最常见的妊娠并发症之一。电子健康记录（EHR）有望实现GDM风险预测，但数据缺失对开发可靠且可推广的风险预测模型构成挑战。本研究旨在解决妊娠12周前GDM预测中EHR数据缺失的问题。

方法

本回顾性研究共纳入5066名单胎妊娠、年龄在18至50岁之间的女性。本研究评估了6种插补方法，并结合4种分类机器学习模型。评估包括下游预测性能、对变量缺失的稳健性、恢复原始数据分布的能力以及基于10折交叉验证对特征选择的影响。

结果

我们的研究结果显示，插补后模型性能有显著提升。使用前30个特征时，采用分类与回归树的链式方程多元插补法（mice）的逻辑回归（LR）模型在受试者工作特征曲线下的面积最高，为0.6899，而未进行插补的LR模型为0.6336。Mice方法在各预测模型中也带来了最佳平均性能，并能最准确地恢复原始数据分布。在不同缺失水平下，基于mice插补数据训练的LR模型仍然是最稳健的。分类算法是预测性能差异的主要原因。此外，我们还确定了中国人群早期GDM预测的18个关键特征。

结论

本研究证明了插补在提高GDM预测模型性能和公平性方面的关键作用。研究结果为将插补方法整合到临床机器学习流程中提供了实用指导。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/fe64/12317186/aa6f3cf04777/10.1177_20552076251352436-fig1.jpg

相似文献

Enhancing early gestational diabetes mellitus prediction with imputation-based machine learning framework: A comparative study on real-world clinical records.

Digit Health. 2025 Jul 29;11:20552076251352436. doi: 10.1177/20552076251352436. eCollection 2025 Jan-Dec.

Does the Presence of Missing Data Affect the Performance of the SORG Machine-learning Algorithm for Patients With Spinal Metastasis? Development of an Internet Application Algorithm.

Clin Orthop Relat Res. 2024 Jan 1;482(1):143-157. doi: 10.1097/CORR.0000000000002706. Epub 2023 Jun 12.

Are Current Survival Prediction Tools Useful When Treating Subsequent Skeletal-related Events From Bone Metastases?

Clin Orthop Relat Res. 2024 Sep 1;482(9):1710-1721. doi: 10.1097/CORR.0000000000003030. Epub 2024 Mar 22.

Comparison of Two Modern Survival Prediction Tools, SORG-MLA and METSSS, in Patients With Symptomatic Long-bone Metastases Who Underwent Local Treatment With Surgery Followed by Radiotherapy and With Radiotherapy Alone.

Clin Orthop Relat Res. 2024 Dec 1;482(12):2193-2208. doi: 10.1097/CORR.0000000000003185. Epub 2024 Jul 23.

A Responsible Framework for Assessing, Selecting, and Explaining Machine Learning Models in Cardiovascular Disease Outcomes Among People With Type 2 Diabetes: Methodology and Validation Study.

JMIR Med Inform. 2025 Jun 27;13:e66200. doi: 10.2196/66200.

Generative adversarial networks for imputing missing data for big data clinical research.

BMC Med Res Methodol. 2021 Apr 20;21(1):78. doi: 10.1186/s12874-021-01272-3.

Predictive modeling of complications arising from early-onset preeclampsia in pregnant women.

Womens Health (Lond). 2025 Jan-Dec;21:17455057251348978. doi: 10.1177/17455057251348978. Epub 2025 Jul 21.

Stabilizing machine learning for reproducible and explainable results: A novel validation approach to subject-specific insights.

Comput Methods Programs Biomed. 2025 Jun 21;269:108899. doi: 10.1016/j.cmpb.2025.108899.

Deciphering Shared Gene Signatures and Immune Infiltration Characteristics Between Gestational Diabetes Mellitus and Preeclampsia by Integrated Bioinformatics Analysis and Machine Learning.

Reprod Sci. 2025 May 15. doi: 10.1007/s43032-025-01847-1.

Assessing and validating machine learning-enhanced imputation of admission American Spinal Injury Association Impairment Scale grades for spinal cord injury.

J Neurosurg Spine. 2025 May 9;43(1):90-97. doi: 10.3171/2025.1.SPINE241135. Print 2025 Jul 1.

本文引用的文献

Machine learning based model for the early detection of Gestational Diabetes Mellitus.

BMC Med Inform Decis Mak. 2025 Mar 13;25(1):130. doi: 10.1186/s12911-025-02947-3.

Validating Multicenter Cohort Circular RNA Model for Early Screening and Diagnosis of Gestational Diabetes Mellitus.

Diabetes Metab J. 2025 May;49(3):462-474. doi: 10.4093/dmj.2024.0205. Epub 2025 Feb 21.

Accurate predictions on small data with a tabular foundation model.

Nature. 2025 Jan;637(8045):319-326. doi: 10.1038/s41586-024-08328-6. Epub 2025 Jan 8.

The limits of fair medical imaging AI in real-world generalization.

Nat Med. 2024 Oct;30(10):2838-2848. doi: 10.1038/s41591-024-03113-4. Epub 2024 Jun 28.

Early pregnancy HbA as the first screening test for gestational diabetes: results from three prospective cohorts.

Lancet Diabetes Endocrinol. 2024 Aug;12(8):535-544. doi: 10.1016/S2213-8587(24)00151-7. Epub 2024 Jun 24.

Meta-EHR: A meta-learning approach for electronic health records with a high imbalanced ratio and missing rate.

Annu Int Conf IEEE Eng Med Biol Soc. 2023 Jul;2023:1-4. doi: 10.1109/EMBC40787.2023.10340634.

Receiver operating characteristic curve analysis in diagnostic accuracy studies: A guide to interpreting the area under the curve value.

Turk J Emerg Med. 2023 Oct 3;23(4):195-198. doi: 10.4103/tjem.tjem_182_23. eCollection 2023 Oct-Dec.

A Simplified Screening Model to Predict the Risk of Gestational Diabetes Mellitus in Pregnant Chinese Women.

Diabetes Ther. 2023 Dec;14(12):2143-2157. doi: 10.1007/s13300-023-01480-8. Epub 2023 Oct 16.

The impact of imputation quality on machine learning classifiers for datasets with missing values.

Commun Med (Lond). 2023 Oct 6;3(1):139. doi: 10.1038/s43856-023-00356-z.

Leakage and the reproducibility crisis in machine-learning-based science.

Patterns (N Y). 2023 Aug 4;4(9):100804. doi: 10.1016/j.patter.2023.100804. eCollection 2023 Sep 8.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

基于插补的机器学习框架增强早期妊娠糖尿病预测：对真实世界临床记录的比较研究

Enhancing early gestational diabetes mellitus prediction with imputation-based machine learning framework: A comparative study on real-world clinical records.

作者信息

机构信息

出版信息

OBJECTIVE

METHODS

RESULTS

CONCLUSION

目的

方法

结果

结论

相似文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

本文引用的文献