Social Networks and Human-Centered Computing, Taiwan International Graduate Program, Institute of Information Science, Academia Sinica, Taipei, Taiwan.
Department of Computer Science, National Chengchi University, Taipei, Taiwan.
J Am Med Inform Assoc. 2019 Mar 1;26(3):228-241. doi: 10.1093/jamia/ocy142.
The aim of this study was to generate synthetic electronic health records (EHRs). The generated EHR data will be more realistic than those generated using the existing medical Generative Adversarial Network (medGAN) method.
We modified medGAN to obtain two synthetic data generation models-designated as medical Wasserstein GAN with gradient penalty (medWGAN) and medical boundary-seeking GAN (medBGAN)-and compared the results obtained using the three models. We used 2 databases: MIMIC-III and National Health Insurance Research Database (NHIRD), Taiwan. First, we trained the models and generated synthetic EHRs by using these three 3 models. We then analyzed and compared the models' performance by using a few statistical methods (Kolmogorov-Smirnov test, dimension-wise probability for binary data, and dimension-wise average count for count data) and 2 machine learning tasks (association rule mining and prediction).
We conducted a comprehensive analysis and found our models were adequately efficient for generating synthetic EHR data. The proposed models outperformed medGAN in all cases, and among the 3 models, boundary-seeking GAN (medBGAN) performed the best.
To generate realistic synthetic EHR data, the proposed models will be effective in the medical industry and related research from the viewpoint of providing better services. Moreover, they will eliminate barriers including limited access to EHR data and thus accelerate research on medical informatics.
The proposed models can adequately learn the data distribution of real EHRs and efficiently generate realistic synthetic EHRs. The results show the superiority of our models over the existing model.
本研究旨在生成合成电子健康记录(EHR)。生成的 EHR 数据将比使用现有医学生成对抗网络(medGAN)方法生成的数据更加真实。
我们修改了 medGAN 以获得两种合成数据生成模型,分别命名为带梯度惩罚的医学 Wasserstein GAN(medWGAN)和医学边界搜索 GAN(medBGAN),并比较了这三种模型的结果。我们使用了两个数据库:MIMIC-III 和台湾国民健康保险研究数据库(NHIRD)。首先,我们训练模型并使用这三种模型生成合成 EHR。然后,我们使用几种统计方法(Kolmogorov-Smirnov 检验、二值数据的维概率和计数数据的维平均计数)和两种机器学习任务(关联规则挖掘和预测)来分析和比较模型的性能。
我们进行了全面的分析,发现我们的模型在生成合成 EHR 数据方面效率足够高。在所分析的所有情况下,所提出的模型均优于 medGAN,而在这三种模型中,边界搜索 GAN(medBGAN)的性能最佳。
为了生成真实的合成 EHR 数据,从提供更好服务的角度来看,所提出的模型将在医疗行业和相关研究中非常有效。此外,它们将消除包括对 EHR 数据的有限访问在内的障碍,从而加速医学信息学的研究。
所提出的模型可以充分学习真实 EHR 的数据分布,并有效地生成真实的合成 EHR。结果表明,我们的模型优于现有的模型。