IEEE J Biomed Health Inform. 2020 Aug;24(8):2378-2388. doi: 10.1109/JBHI.2020.2980262. Epub 2020 Mar 12.
The medical and machine learning communities are relying on the promise of artificial intelligence (AI) to transform medicine through enabling more accurate decisions and personalized treatment. However, progress is slow. Legal and ethical issues around unconsented patient data and privacy is one of the limiting factors in data sharing, resulting in a significant barrier in accessing routinely collected electronic health records (EHR) by the machine learning community. We propose a novel framework for generating synthetic data that closely approximates the joint distribution of variables in an original EHR dataset, providing a readily accessible, legally and ethically appropriate solution to support more open data sharing, enabling the development of AI solutions. In order to address issues around lack of clarity in defining sufficient anonymization, we created a quantifiable, mathematical definition for "identifiability". We used a conditional generative adversarial networks (GAN) framework to generate synthetic data while minimize patient identifiability that is defined based on the probability of re-identification given the combination of all data on any individual patient. We compared models fitted to our synthetically generated data to those fitted to the real data across four independent datasets to evaluate similarity in model performance, while assessing the extent to which original observations can be identified from the synthetic data. Our model, ADS-GAN, consistently outperformed state-of-the-art methods, and demonstrated reliability in the joint distributions. We propose that this method could be used to develop datasets that can be made publicly available while considerably lowering the risk of breaching patient confidentiality.
医疗和机器学习社区都依赖人工智能 (AI) 的承诺,通过实现更准确的决策和个性化治疗来改变医学。然而,进展缓慢。未经同意的患者数据和隐私的法律和伦理问题是数据共享的限制因素之一,这导致机器学习社区在访问常规收集的电子健康记录 (EHR) 方面存在重大障碍。我们提出了一种生成合成数据的新框架,该框架可以很好地逼近原始 EHR 数据集的变量联合分布,为支持更开放的数据共享提供了一种易于访问、合法且合乎道德的解决方案,从而能够开发 AI 解决方案。为了解决在定义充分匿名化方面缺乏明确性的问题,我们创建了一个可量化的、数学定义的“可识别性”。我们使用条件生成对抗网络 (GAN) 框架来生成合成数据,同时最大限度地降低基于给定任何单个患者的所有数据组合的重新识别概率来定义的患者可识别性。我们将针对我们的合成生成数据拟合的模型与针对四个独立数据集的真实数据拟合的模型进行比较,以评估模型性能的相似性,同时评估从合成数据中识别原始观测值的程度。我们的模型 ADS-GAN 始终优于最先进的方法,并在联合分布中表现出可靠性。我们建议可以使用这种方法来开发可以公开提供的数据集,同时大大降低违反患者隐私的风险。