Department of Biomedical Informatics & Medical Education, University of Washington, Seattle, Washington, USA.
Division of General Medical Sciences, School of Medicine, Washington University in St. Louis, St. Louis, Missouri, USA.
J Am Med Inform Assoc. 2022 Jul 12;29(8):1350-1365. doi: 10.1093/jamia/ocac045.
This study sought to evaluate whether synthetic data derived from a national coronavirus disease 2019 (COVID-19) dataset could be used for geospatial and temporal epidemic analyses.
Using an original dataset (n = 1 854 968 severe acute respiratory syndrome coronavirus 2 tests) and its synthetic derivative, we compared key indicators of COVID-19 community spread through analysis of aggregate and zip code-level epidemic curves, patient characteristics and outcomes, distribution of tests by zip code, and indicator counts stratified by month and zip code. Similarity between the data was statistically and qualitatively evaluated.
In general, synthetic data closely matched original data for epidemic curves, patient characteristics, and outcomes. Synthetic data suppressed labels of zip codes with few total tests (mean = 2.9 ± 2.4; max = 16 tests; 66% reduction of unique zip codes). Epidemic curves and monthly indicator counts were similar between synthetic and original data in a random sample of the most tested (top 1%; n = 171) and for all unsuppressed zip codes (n = 5819), respectively. In small sample sizes, synthetic data utility was notably decreased.
Analyses on the population-level and of densely tested zip codes (which contained most of the data) were similar between original and synthetically derived datasets. Analyses of sparsely tested populations were less similar and had more data suppression.
In general, synthetic data were successfully used to analyze geospatial and temporal trends. Analyses using small sample sizes or populations were limited, in part due to purposeful data label suppression-an attribute disclosure countermeasure. Users should consider data fitness for use in these cases.
本研究旨在评估源自全国 2019 年冠状病毒病(COVID-19)数据集的合成数据是否可用于地理空间和时间流行分析。
使用原始数据集(n=1 854 968 例严重急性呼吸综合征冠状病毒 2 检测)及其合成衍生数据,我们通过分析总体和邮政编码级别的流行曲线、患者特征和结局、邮政编码的检测分布以及按月份和邮政编码分层的指标计数,比较了 COVID-19 社区传播的关键指标。通过统计学和定性评估来比较数据之间的相似性。
总体而言,合成数据在流行曲线、患者特征和结局方面与原始数据非常匹配。合成数据抑制了总检测量较少的邮政编码标签(平均值=2.9±2.4;最大值=16 次检测;66%减少唯一邮政编码)。在最大测试的随机样本中(前 1%;n=171)和所有未抑制的邮政编码(n=5819)中,合成数据和原始数据的流行曲线和每月指标计数相似。在小样本量下,合成数据的实用性明显降低。
在人口水平和高度测试的邮政编码(包含大部分数据)上的分析,原始数据集和从合成数据集中得出的数据集之间相似。对稀疏测试人群的分析则不太相似,并且数据抑制更多。
总体而言,合成数据成功地用于分析地理空间和时间趋势。使用小样本量或人群的分析受到限制,部分原因是有针对性的数据标签抑制-一种属性披露对策。在这些情况下,用户应考虑数据的适用性。