Department of Electrical Engineering and Computer Science, Vanderbilt University, Nashville, Tennessee, USA.
Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, Tennessee, USA.
J Am Med Inform Assoc. 2020 Jan 1;27(1):99-108. doi: 10.1093/jamia/ocz161.
Electronic medical records (EMRs) can support medical research and discovery, but privacy risks limit the sharing of such data on a wide scale. Various approaches have been developed to mitigate risk, including record simulation via generative adversarial networks (GANs). While showing promise in certain application domains, GANs lack a principled approach for EMR data that induces subpar simulation. In this article, we improve EMR simulation through a novel pipeline that (1) enhances the learning model, (2) incorporates evaluation criteria for data utility that informs learning, and (3) refines the training process.
We propose a new electronic health record generator using a GAN with a Wasserstein divergence and layer normalization techniques. We designed 2 utility measures to characterize similarity in the structural properties of real and simulated EMRs in the original and latent space, respectively. We applied a filtering strategy to enhance GAN training for low-prevalence clinical concepts. We evaluated the new and existing GANs with utility and privacy measures (membership and disclosure attacks) using billing codes from over 1 million EMRs at Vanderbilt University Medical Center.
The proposed model outperformed the state-of-the-art approaches with significant improvement in retaining the nature of real records, including prediction performance and structural properties, without sacrificing privacy. Additionally, the filtering strategy achieved higher utility when the EMR training dataset was small.
These findings illustrate that EMR simulation through GANs can be substantially improved through more appropriate training, modeling, and evaluation criteria.
目的:电子病历(EMR)可以支持医学研究和发现,但隐私风险限制了此类数据的广泛共享。已经开发了各种方法来降低风险,包括通过生成对抗网络(GAN)进行记录模拟。虽然在某些应用领域有一定的前景,但 GAN 缺乏一种针对 EMR 数据的原则性方法,无法实现较差的模拟效果。在本文中,我们通过一种新的流水线来改进 EMR 模拟,该流水线(1)增强学习模型,(2)纳入数据效用评估标准,以指导学习,(3)改进训练过程。
材料与方法:我们提出了一种使用带有 Wasserstein 分歧和层归一化技术的 GAN 的新型电子健康记录生成器。我们设计了 2 种效用度量标准,分别用于在原始和潜在空间中描述真实和模拟 EMR 结构属性的相似性。我们应用了一种过滤策略来增强低患病率临床概念的 GAN 训练。我们使用来自范德比尔特大学医学中心超过 100 万份 EMR 的计费代码,使用效用和隐私度量(成员和披露攻击)评估了新的和现有的 GAN。
结果:与现有最先进的方法相比,所提出的模型在保留真实记录的性质方面表现出色,包括预测性能和结构属性,同时不牺牲隐私。此外,当 EMR 训练数据集较小时,过滤策略可以实现更高的效用。
结论:这些发现表明,通过更合适的训练、建模和评估标准,通过 GAN 进行 EMR 模拟可以得到实质性的改进。