Theodorou Brandon, Xiao Cao, Glass Lucas, Sun Jimeng
Department of Computer Science, University of Illinois at Urbana-Champaign, Urbana, IL, USA.
GE Healthcare, Seattle, WA, USA.
Patterns (N Y). 2025 May 8;6(6):101261. doi: 10.1016/j.patter.2025.101261. eCollection 2025 Jun 13.
We introduce MediSim, a multi-modal generative model for simulating and augmenting electronic health records across multiple modalities, including structured codes, clinical notes, and medical imaging. MediSim employs a multi-granular, autoregressive architecture to simulate missing modalities and visits and iterative, reinforcement learning-based training to improve simulation in low-data settings. Additionally, it utilizes encoder-decoder model pairs to handle complex modalities like notes and images. Experiments on outpatient claims and inpatient ICU datasets have demonstrated MediSim's superiority over baselines in predicting missing codes, creating enriched data, and improving downstream predictive modeling. Specifically, MediSim improved over 74% on missing code prediction, enabled up to 65% better downstream predictive performance compared to original deficient records missing either some visits or entire data modalities, and successfully produced realistic note and X-ray samples for use in downstream tasks. MediSim's ability to generate comprehensive, high-dimensional EHR data has the potential to significantly improve AI applications throughout healthcare.
我们介绍了MediSim,这是一种多模态生成模型,用于跨多种模态模拟和扩充电子健康记录,这些模态包括结构化代码、临床笔记和医学影像。MediSim采用多粒度自回归架构来模拟缺失的模态和就诊信息,并通过基于强化学习的迭代训练来改善低数据环境下的模拟效果。此外,它利用编码器-解码器模型对来处理像笔记和图像这样的复杂模态。在门诊索赔和住院重症监护病房数据集上进行的实验表明,MediSim在预测缺失代码、创建丰富数据以及改进下游预测建模方面优于基线模型。具体而言,MediSim在缺失代码预测方面提高了74%以上,与缺少某些就诊信息或整个数据模态的原始缺陷记录相比,下游预测性能提高了65%,并成功生成了用于下游任务的逼真的笔记和X光样本。MediSim生成全面、高维电子健康记录数据的能力有可能显著改善整个医疗保健领域的人工智能应用。