Department of Computer Science and Engineering, Pohang University of Science and Technology, Pohang, South Korea.
School of Biomedical Informatics, University of Texas Health Science Center at Houston, Houston, Texas, USA.
J Am Med Inform Assoc. 2020 Jul 1;27(9):1411-1419. doi: 10.1093/jamia/ocaa119.
Recent studies on electronic health records (EHRs) started to learn deep generative models and synthesize a huge amount of realistic records, in order to address significant privacy issues surrounding the EHR. However, most of them only focus on structured records about patients' independent visits, rather than on chronological clinical records. In this article, we aim to learn and synthesize realistic sequences of EHRs based on the generative autoencoder.
We propose a dual adversarial autoencoder (DAAE), which learns set-valued sequences of medical entities, by combining a recurrent autoencoder with 2 generative adversarial networks (GANs). DAAE improves the mode coverage and quality of generated sequences by adversarially learning both the continuous latent distribution and the discrete data distribution. Using the MIMIC-III (Medical Information Mart for Intensive Care-III) and UT Physicians clinical databases, we evaluated the performances of DAAE in terms of predictive modeling, plausibility, and privacy preservation.
Our generated sequences of EHRs showed the comparable performances to real data for a predictive modeling task, and achieved the best score in plausibility evaluation conducted by medical experts among all baseline models. In addition, differentially private optimization of our model enables to generate synthetic sequences without increasing the privacy leakage of patients' data.
DAAE can effectively synthesize sequential EHRs by addressing its main challenges: the synthetic records should be realistic enough not to be distinguished from the real records, and they should cover all the training patients to reproduce the performance of specific downstream tasks.
最近的电子健康记录 (EHR) 研究开始学习深度生成模型,并综合大量现实记录,以解决围绕 EHR 的重大隐私问题。然而,它们大多只关注患者独立就诊的结构化记录,而不是按时间顺序排列的临床记录。在本文中,我们旨在基于生成式自动编码器学习和综合现实的 EHR 序列。
我们提出了一种双重对抗自动编码器 (DAAE),它通过将循环自动编码器与 2 个生成式对抗网络 (GAN) 相结合,学习医学实体的集值序列。DAAE 通过对抗性学习连续潜在分布和离散数据分布,提高了生成序列的模式覆盖和质量。使用 MIMIC-III(重症监护医疗信息市场-III)和 UT 医生临床数据库,我们从预测建模、真实性和隐私保护的角度评估了 DAAE 的性能。
我们生成的 EHR 序列在预测建模任务方面表现与真实数据相当,并且在所有基线模型中,在医学专家进行的真实性评估中获得了最佳得分。此外,我们模型的差分隐私优化可以生成合成序列,而不会增加患者数据的隐私泄露。
DAAE 可以有效地综合顺序 EHR,解决其主要挑战:合成记录应足够真实,无法与真实记录区分开来,并且应涵盖所有训练患者,以重现特定下游任务的性能。