Suppr超能文献

通过预测扩散模型合成多模态电子健康记录

Synthesizing Multimodal Electronic Health Records via Predictive Diffusion Models.

作者信息

Zhong Yuan, Wang Xiaochen, Wang Jiaqi, Zhang Xiaokun, Wang Yaqing, Huai Mengdi, Xiao Cao, Ma Fenglong

机构信息

The Pennsylvania State University, University Park, PA, USA.

Dalian University of Technology, Dalian, Liaoning, China.

出版信息

KDD. 2024 Aug;2024:4607-4618. doi: 10.1145/3637528.3671836. Epub 2024 Aug 24.

Abstract

Synthesizing electronic health records (EHR) data has become a preferred strategy to address data scarcity, improve data quality, and model fairness in healthcare. However, existing approaches for EHR data generation predominantly rely on state-of-the-art generative techniques like generative adversarial networks, variational autoencoders, and language models. These methods typically replicate input visits, resulting in inadequate modeling of temporal dependencies between visits and overlooking the generation of time information, a crucial element in EHR data. Moreover, their ability to learn visit representations is limited due to simple linear mapping functions, thus compromising generation quality. To address these limitations, we propose a novel EHR data generation model called EHRPD. It is a diffusion-based model designed to predict the next visit based on the current one while also incorporating time interval estimation. To enhance generation quality and diversity, we introduce a novel time-aware visit embedding module and a pioneering predictive denoising diffusion probabilistic model (P-DDPM). Additionally, we devise a predictive U-Net (PU-Net) to optimize P-DDPM. We conduct experiments on two public datasets and evaluate EHRPD from fidelity, privacy, and utility perspectives. The experimental results demonstrate the efficacy and utility of the proposed EHRPD in addressing the aforementioned limitations and advancing EHR data generation.

摘要

合成电子健康记录(EHR)数据已成为解决医疗保健领域数据稀缺、提高数据质量和模型公平性的首选策略。然而,现有的EHR数据生成方法主要依赖于生成对抗网络、变分自编码器和语言模型等先进的生成技术。这些方法通常只是复制输入的就诊记录,导致对就诊记录之间的时间依赖性建模不足,并且忽略了时间信息的生成,而时间信息是EHR数据中的关键要素。此外,由于简单的线性映射函数,它们学习就诊表示的能力有限,从而影响了生成质量。为了解决这些局限性,我们提出了一种名为EHRPD的新型EHR数据生成模型。它是一种基于扩散的模型,旨在根据当前就诊记录预测下一次就诊记录,同时还纳入了时间间隔估计。为了提高生成质量和多样性,我们引入了一种新型的时间感知就诊嵌入模块和一种开创性的预测去噪扩散概率模型(P-DDPM)。此外,我们设计了一种预测U-Net(PU-Net)来优化P-DDPM。我们在两个公共数据集上进行了实验,并从保真度、隐私性和实用性角度对EHRPD进行了评估。实验结果证明了所提出的EHRPD在解决上述局限性和推进EHR数据生成方面的有效性和实用性。

相似文献

1
Synthesizing Multimodal Electronic Health Records via Predictive Diffusion Models.
KDD. 2024 Aug;2024:4607-4618. doi: 10.1145/3637528.3671836. Epub 2024 Aug 24.
2
Reliable generation of privacy-preserving synthetic electronic health record time series via diffusion models.
J Am Med Inform Assoc. 2024 Nov 1;31(11):2529-2539. doi: 10.1093/jamia/ocae229.
3
Fast-DDPM: Fast Denoising Diffusion Probabilistic Models for Medical Image-to-Image Generation.
IEEE J Biomed Health Inform. 2025 Apr 28;PP. doi: 10.1109/JBHI.2025.3565183.
4
Incorporating medical code descriptions for diagnosis prediction in healthcare.
BMC Med Inform Decis Mak. 2019 Dec 19;19(Suppl 6):267. doi: 10.1186/s12911-019-0961-2.
6
Semi-Implicit Denoising Diffusion Models (SIDDMs).
Adv Neural Inf Process Syst. 2023 Dec;36:17383-17394. Epub 2024 May 30.
9
Generating sequential electronic health records using dual adversarial autoencoder.
J Am Med Inform Assoc. 2020 Jul 1;27(9):1411-1419. doi: 10.1093/jamia/ocaa119.

引用本文的文献

本文引用的文献

1
PromptEHR: Conditional Electronic Healthcare Records Generation with Prompt Learning.
Proc Conf Empir Methods Nat Lang Process. 2022 Dec;2022:2873-2885. doi: 10.18653/v1/2022.emnlp-main.185.
2
MedDiffusion: Boosting Health Risk Prediction via Diffusion-based Data Augmentation.
Proc SIAM Int Conf Data Min. 2024;2024:499-507. doi: 10.1137/1.9781611978032.58.
3
Hierarchical Pretraining on Multimodal Electronic Health Records.
Proc Conf Empir Methods Nat Lang Process. 2023 Dec;2023:2839-2852. doi: 10.18653/v1/2023.emnlp-main.171.
5
Image Super-Resolution via Iterative Refinement.
IEEE Trans Pattern Anal Mach Intell. 2023 Apr;45(4):4713-4726. doi: 10.1109/TPAMI.2022.3204461. Epub 2023 Mar 7.
6
SynTEG: a framework for temporal structured electronic health data simulation.
J Am Med Inform Assoc. 2021 Mar 1;28(3):596-604. doi: 10.1093/jamia/ocaa262.
7
Democratizing EHR analyses with FIDDLE: a flexible data-driven preprocessing pipeline for structured clinical data.
J Am Med Inform Assoc. 2020 Dec 9;27(12):1921-1934. doi: 10.1093/jamia/ocaa139.
8
Synthesizing electronic health records using improved generative adversarial networks.
J Am Med Inform Assoc. 2019 Mar 1;26(3):228-241. doi: 10.1093/jamia/ocy142.
10
MIMIC-III, a freely accessible critical care database.
Sci Data. 2016 May 24;3:160035. doi: 10.1038/sdata.2016.35.

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验