Suppr超能文献

基于插补的机器学习框架增强早期妊娠糖尿病预测:对真实世界临床记录的比较研究

Enhancing early gestational diabetes mellitus prediction with imputation-based machine learning framework: A comparative study on real-world clinical records.

作者信息

Ma Leyao, Yang Lin, Wang Yaxin, Hao Jie, Li Yini, Ma Liangkun, Wang Ziyang, Li Ye, Zhang Suhan, Hu Mingyue, Li Jiao, Sun Yin

机构信息

Institute of Medical Information, Chinese Academy of Medical Science & Peking Union Medical College, Beijing, China.

Key Laboratory of Medical Information Intelligent Technology, Chinese Academy of Medical Sciences & Peking Union Medical College, Beijing, China.

出版信息

Digit Health. 2025 Jul 29;11:20552076251352436. doi: 10.1177/20552076251352436. eCollection 2025 Jan-Dec.

Abstract

OBJECTIVE

Gestational diabetes mellitus (GDM) is one of the most common pregnancy complications. Electronic health records (EHRs) promise GDM risk prediction, but missing data poses a challenge to developing reliable and generalizable risk prediction models. This study aims to address the problem of missing EHR data in GDM prediction before 12 weeks gestation.

METHODS

A total of 5066 women with singleton pregnancies, aged 18 to 50, were included in this retrospective study. This study evaluated 6 imputation methods, combined with 4 classification machine learning models. The evaluation encompassed downstream predictive performance, robustness to variable missingness, ability to restore original data distribution, and influence on feature selection based on 10-fold cross-validation.

RESULTS

Our findings revealed a significant improvement in model performance with imputation. When using the top 30 features, logistic regression (LR) with multivariate imputation by chained equations using classification and regression trees (mice) achieved the highest area under the receiver operating characteristic curve of 0.6899, compared to 0.6336 for the LR model without imputation. Mice also led to the best average performance across prediction models and yielded the most accurate restoration of the original data distribution. LR models trained on data imputed by mice remained the most robust across varying levels of missingness. The classification algorithm primarily accounted for differences in predictive performance. In addition, we identified 18 key features for early GDM prediction in the Chinese population.

CONCLUSION

This study demonstrates the critical role of imputation in improving the performance and fairness of GDM prediction models. The findings provide practical guidance for integrating imputation into clinical machine learning pipelines.

摘要

目的

妊娠期糖尿病(GDM)是最常见的妊娠并发症之一。电子健康记录(EHR)有望实现GDM风险预测,但数据缺失对开发可靠且可推广的风险预测模型构成挑战。本研究旨在解决妊娠12周前GDM预测中EHR数据缺失的问题。

方法

本回顾性研究共纳入5066名单胎妊娠、年龄在18至50岁之间的女性。本研究评估了6种插补方法,并结合4种分类机器学习模型。评估包括下游预测性能、对变量缺失的稳健性、恢复原始数据分布的能力以及基于10折交叉验证对特征选择的影响。

结果

我们的研究结果显示,插补后模型性能有显著提升。使用前30个特征时,采用分类与回归树的链式方程多元插补法(mice)的逻辑回归(LR)模型在受试者工作特征曲线下的面积最高,为0.6899,而未进行插补的LR模型为0.6336。Mice方法在各预测模型中也带来了最佳平均性能,并能最准确地恢复原始数据分布。在不同缺失水平下,基于mice插补数据训练的LR模型仍然是最稳健的。分类算法是预测性能差异的主要原因。此外,我们还确定了中国人群早期GDM预测的18个关键特征。

结论

本研究证明了插补在提高GDM预测模型性能和公平性方面的关键作用。研究结果为将插补方法整合到临床机器学习流程中提供了实用指导。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/fe64/12317186/aa6f3cf04777/10.1177_20552076251352436-fig1.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验