Suppr超能文献

展示一种评估合成地理空间和时间流行病学数据效用的方法:对美国国家 COVID 队列协作(N3C)中超过 180 万次 SARS-CoV-2 检测进行分析的结果。

Demonstrating an approach for evaluating synthetic geospatial and temporal epidemiologic data utility: results from analyzing >1.8 million SARS-CoV-2 tests in the United States National COVID Cohort Collaborative (N3C).

机构信息

Department of Biomedical Informatics & Medical Education, University of Washington, Seattle, Washington, USA.

Division of General Medical Sciences, School of Medicine, Washington University in St. Louis, St. Louis, Missouri, USA.

出版信息

J Am Med Inform Assoc. 2022 Jul 12;29(8):1350-1365. doi: 10.1093/jamia/ocac045.

Abstract

OBJECTIVE

This study sought to evaluate whether synthetic data derived from a national coronavirus disease 2019 (COVID-19) dataset could be used for geospatial and temporal epidemic analyses.

MATERIALS AND METHODS

Using an original dataset (n = 1 854 968 severe acute respiratory syndrome coronavirus 2 tests) and its synthetic derivative, we compared key indicators of COVID-19 community spread through analysis of aggregate and zip code-level epidemic curves, patient characteristics and outcomes, distribution of tests by zip code, and indicator counts stratified by month and zip code. Similarity between the data was statistically and qualitatively evaluated.

RESULTS

In general, synthetic data closely matched original data for epidemic curves, patient characteristics, and outcomes. Synthetic data suppressed labels of zip codes with few total tests (mean = 2.9 ± 2.4; max = 16 tests; 66% reduction of unique zip codes). Epidemic curves and monthly indicator counts were similar between synthetic and original data in a random sample of the most tested (top 1%; n = 171) and for all unsuppressed zip codes (n = 5819), respectively. In small sample sizes, synthetic data utility was notably decreased.

DISCUSSION

Analyses on the population-level and of densely tested zip codes (which contained most of the data) were similar between original and synthetically derived datasets. Analyses of sparsely tested populations were less similar and had more data suppression.

CONCLUSION

In general, synthetic data were successfully used to analyze geospatial and temporal trends. Analyses using small sample sizes or populations were limited, in part due to purposeful data label suppression-an attribute disclosure countermeasure. Users should consider data fitness for use in these cases.

摘要

目的

本研究旨在评估源自全国 2019 年冠状病毒病(COVID-19)数据集的合成数据是否可用于地理空间和时间流行分析。

材料和方法

使用原始数据集(n=1 854 968 例严重急性呼吸综合征冠状病毒 2 检测)及其合成衍生数据,我们通过分析总体和邮政编码级别的流行曲线、患者特征和结局、邮政编码的检测分布以及按月份和邮政编码分层的指标计数,比较了 COVID-19 社区传播的关键指标。通过统计学和定性评估来比较数据之间的相似性。

结果

总体而言,合成数据在流行曲线、患者特征和结局方面与原始数据非常匹配。合成数据抑制了总检测量较少的邮政编码标签(平均值=2.9±2.4;最大值=16 次检测;66%减少唯一邮政编码)。在最大测试的随机样本中(前 1%;n=171)和所有未抑制的邮政编码(n=5819)中,合成数据和原始数据的流行曲线和每月指标计数相似。在小样本量下,合成数据的实用性明显降低。

讨论

在人口水平和高度测试的邮政编码(包含大部分数据)上的分析,原始数据集和从合成数据集中得出的数据集之间相似。对稀疏测试人群的分析则不太相似,并且数据抑制更多。

结论

总体而言,合成数据成功地用于分析地理空间和时间趋势。使用小样本量或人群的分析受到限制,部分原因是有针对性的数据标签抑制-一种属性披露对策。在这些情况下,用户应考虑数据的适用性。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/be21/9277637/5e835afabc18/ocac045f1.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验