Massachusetts Institute of Technology, Cambridge, Massachusetts, USA.
Rochester Institute of Technology, Rochester, New York, USA.
J Am Med Inform Assoc. 2021 Mar 18;28(4):801-811. doi: 10.1093/jamia/ocaa303.
This study seeks to develop a fully automated method of generating synthetic data from a real dataset that could be employed by medical organizations to distribute health data to researchers, reducing the need for access to real data. We hypothesize the application of Bayesian networks will improve upon the predominant existing method, medBGAN, in handling the complexity and dimensionality of healthcare data.
We employed Bayesian networks to learn probabilistic graphical structures and simulated synthetic patient records from the learned structure. We used the University of California Irvine (UCI) heart disease and diabetes datasets as well as the MIMIC-III diagnoses database. We evaluated our method through statistical tests, machine learning tasks, preservation of rare events, disclosure risk, and the ability of a machine learning classifier to discriminate between the real and synthetic data.
Our Bayesian network model outperformed or equaled medBGAN in all key metrics. Notable improvement was achieved in capturing rare variables and preserving association rules.
Bayesian networks generated data sufficiently similar to the original data with minimal risk of disclosure, while offering additional transparency, computational efficiency, and capacity to handle more data types in comparison to existing methods. We hope this method will allow healthcare organizations to efficiently disseminate synthetic health data to researchers, enabling them to generate hypotheses and develop analytical tools.
We conclude the application of Bayesian networks is a promising option for generating realistic synthetic health data that preserves the features of the original data without compromising data privacy.
本研究旨在开发一种从真实数据集生成合成数据的全自动方法,该方法可供医疗组织将健康数据分发给研究人员使用,从而减少对真实数据的需求。我们假设贝叶斯网络的应用将改善现有的主要方法 medBGAN,以处理医疗保健数据的复杂性和维度。
我们使用贝叶斯网络学习概率图形结构,并从学习的结构中模拟合成的患者记录。我们使用了加利福尼亚大学欧文分校(UCI)心脏病和糖尿病数据集以及 MIMIC-III 诊断数据库。我们通过统计检验、机器学习任务、稀有事件的保留、披露风险以及机器学习分类器区分真实数据和合成数据的能力来评估我们的方法。
我们的贝叶斯网络模型在所有关键指标上均优于或等同于 medBGAN。在捕获稀有变量和保留关联规则方面取得了显著的改进。
贝叶斯网络生成的数据与原始数据足够相似,披露风险最小,同时提供了额外的透明度、计算效率以及与现有方法相比处理更多数据类型的能力。我们希望这种方法将使医疗保健组织能够有效地向研究人员分发合成健康数据,使他们能够生成假设并开发分析工具。
我们得出结论,贝叶斯网络的应用是生成真实合成健康数据的一种有前途的方法,该方法可以在不损害数据隐私的情况下保留原始数据的特征。