通过预测扩散模型合成多模态电子健康记录

Synthesizing Multimodal Electronic Health Records via Predictive Diffusion Models.

作者信息

Zhong Yuan, Wang Xiaochen, Wang Jiaqi, Zhang Xiaokun, Wang Yaqing, Huai Mengdi, Xiao Cao, Ma Fenglong

机构信息

The Pennsylvania State University, University Park, PA, USA.

Dalian University of Technology, Dalian, Liaoning, China.

出版信息

KDD. 2024 Aug;2024:4607-4618. doi: 10.1145/3637528.3671836. Epub 2024 Aug 24.

DOI:10.1145/3637528.3671836

PMID:40255538

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12009115/

Abstract

Synthesizing electronic health records (EHR) data has become a preferred strategy to address data scarcity, improve data quality, and model fairness in healthcare. However, existing approaches for EHR data generation predominantly rely on state-of-the-art generative techniques like generative adversarial networks, variational autoencoders, and language models. These methods typically replicate input visits, resulting in inadequate modeling of temporal dependencies between visits and overlooking the generation of time information, a crucial element in EHR data. Moreover, their ability to learn visit representations is limited due to simple linear mapping functions, thus compromising generation quality. To address these limitations, we propose a novel EHR data generation model called EHRPD. It is a diffusion-based model designed to predict the next visit based on the current one while also incorporating time interval estimation. To enhance generation quality and diversity, we introduce a novel time-aware visit embedding module and a pioneering predictive denoising diffusion probabilistic model (P-DDPM). Additionally, we devise a predictive U-Net (PU-Net) to optimize P-DDPM. We conduct experiments on two public datasets and evaluate EHRPD from fidelity, privacy, and utility perspectives. The experimental results demonstrate the efficacy and utility of the proposed EHRPD in addressing the aforementioned limitations and advancing EHR data generation.

摘要

合成电子健康记录（EHR）数据已成为解决医疗保健领域数据稀缺、提高数据质量和模型公平性的首选策略。然而，现有的EHR数据生成方法主要依赖于生成对抗网络、变分自编码器和语言模型等先进的生成技术。这些方法通常只是复制输入的就诊记录，导致对就诊记录之间的时间依赖性建模不足，并且忽略了时间信息的生成，而时间信息是EHR数据中的关键要素。此外，由于简单的线性映射函数，它们学习就诊表示的能力有限，从而影响了生成质量。为了解决这些局限性，我们提出了一种名为EHRPD的新型EHR数据生成模型。它是一种基于扩散的模型，旨在根据当前就诊记录预测下一次就诊记录，同时还纳入了时间间隔估计。为了提高生成质量和多样性，我们引入了一种新型的时间感知就诊嵌入模块和一种开创性的预测去噪扩散概率模型（P-DDPM）。此外，我们设计了一种预测U-Net（PU-Net）来优化P-DDPM。我们在两个公共数据集上进行了实验，并从保真度、隐私性和实用性角度对EHRPD进行了评估。实验结果证明了所提出的EHRPD在解决上述局限性和推进EHR数据生成方面的有效性和实用性。