Gonzales Aldren, Guruswamy Guruprabha, Smith Scott R
Office of the Assistant Secretary Planning and Evaluation, US Department of Health and Human Services, Washington, District of Columbia, United States of America.
Department of Health Administration and Policy, George Mason University, Virginia, United States of America.
PLOS Digit Health. 2023 Jan 6;2(1):e0000082. doi: 10.1371/journal.pdig.0000082. eCollection 2023 Jan.
Data are central to research, public health, and in developing health information technology (IT) systems. Nevertheless, access to most data in health care is tightly controlled, which may limit innovation, development, and efficient implementation of new research, products, services, or systems. Using synthetic data is one of the many innovative ways that can allow organizations to share datasets with broader users. However, only a limited set of literature is available that explores its potentials and applications in health care. In this review paper, we examined existing literature to bridge the gap and highlight the utility of synthetic data in health care. We searched PubMed, Scopus, and Google Scholar to identify peer-reviewed articles, conference papers, reports, and thesis/dissertations articles related to the generation and use of synthetic datasets in health care. The review identified seven use cases of synthetic data in health care: a) simulation and prediction research, b) hypothesis, methods, and algorithm testing, c) epidemiology/public health research, d) health IT development, e) education and training, f) public release of datasets, and g) linking data. The review also identified readily and publicly accessible health care datasets, databases, and sandboxes containing synthetic data with varying degrees of utility for research, education, and software development. The review provided evidence that synthetic data are helpful in different aspects of health care and research. While the original real data remains the preferred choice, synthetic data hold possibilities in bridging data access gaps in research and evidence-based policymaking.
数据对于研究、公共卫生以及健康信息技术(IT)系统的开发至关重要。然而,医疗保健领域中大多数数据的访问受到严格控制,这可能会限制新研究、产品、服务或系统的创新、开发和有效实施。使用合成数据是众多创新方式之一,可使组织与更广泛的用户共享数据集。然而,探讨其在医疗保健领域的潜力和应用的文献有限。在这篇综述论文中,我们研究了现有文献以弥合差距,并突出合成数据在医疗保健中的实用性。我们检索了PubMed、Scopus和谷歌学术,以识别与医疗保健中合成数据集的生成和使用相关的同行评审文章、会议论文、报告以及论文/学位论文。该综述确定了合成数据在医疗保健中的七个用例:a)模拟和预测研究,b)假设、方法和算法测试,c)流行病学/公共卫生研究,d)健康IT开发,e)教育和培训,f)数据集的公开发布,以及g)数据链接。该综述还确定了易于获取且公开可用的医疗保健数据集、数据库和沙盒,其中包含对研究、教育和软件开发具有不同程度实用性的合成数据。该综述提供了证据表明合成数据在医疗保健和研究的不同方面都有帮助。虽然原始真实数据仍然是首选,但合成数据在弥合研究和循证决策中的数据访问差距方面具有潜力。