展示一种评估合成地理空间和时间流行病学数据效用的方法：对美国国家 COVID 队列协作（N3C）中超过 180 万次 SARS-CoV-2 检测进行分析的结果。

Demonstrating an approach for evaluating synthetic geospatial and temporal epidemiologic data utility: results from analyzing >1.8 million SARS-CoV-2 tests in the United States National COVID Cohort Collaborative (N3C).

机构信息

Department of Biomedical Informatics & Medical Education, University of Washington, Seattle, Washington, USA.

Division of General Medical Sciences, School of Medicine, Washington University in St. Louis, St. Louis, Missouri, USA.

出版信息

J Am Med Inform Assoc. 2022 Jul 12;29(8):1350-1365. doi: 10.1093/jamia/ocac045.

DOI:10.1093/jamia/ocac045

PMID:35357487

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC8992357/

Abstract

OBJECTIVE

This study sought to evaluate whether synthetic data derived from a national coronavirus disease 2019 (COVID-19) dataset could be used for geospatial and temporal epidemic analyses.

MATERIALS AND METHODS

Using an original dataset (n = 1 854 968 severe acute respiratory syndrome coronavirus 2 tests) and its synthetic derivative, we compared key indicators of COVID-19 community spread through analysis of aggregate and zip code-level epidemic curves, patient characteristics and outcomes, distribution of tests by zip code, and indicator counts stratified by month and zip code. Similarity between the data was statistically and qualitatively evaluated.

RESULTS

In general, synthetic data closely matched original data for epidemic curves, patient characteristics, and outcomes. Synthetic data suppressed labels of zip codes with few total tests (mean = 2.9 ± 2.4; max = 16 tests; 66% reduction of unique zip codes). Epidemic curves and monthly indicator counts were similar between synthetic and original data in a random sample of the most tested (top 1%; n = 171) and for all unsuppressed zip codes (n = 5819), respectively. In small sample sizes, synthetic data utility was notably decreased.

DISCUSSION

Analyses on the population-level and of densely tested zip codes (which contained most of the data) were similar between original and synthetically derived datasets. Analyses of sparsely tested populations were less similar and had more data suppression.

CONCLUSION

In general, synthetic data were successfully used to analyze geospatial and temporal trends. Analyses using small sample sizes or populations were limited, in part due to purposeful data label suppression-an attribute disclosure countermeasure. Users should consider data fitness for use in these cases.

摘要

目的

本研究旨在评估源自全国 2019 年冠状病毒病（COVID-19）数据集的合成数据是否可用于地理空间和时间流行分析。

材料和方法

使用原始数据集（n=1 854 968 例严重急性呼吸综合征冠状病毒 2 检测）及其合成衍生数据，我们通过分析总体和邮政编码级别的流行曲线、患者特征和结局、邮政编码的检测分布以及按月份和邮政编码分层的指标计数，比较了 COVID-19 社区传播的关键指标。通过统计学和定性评估来比较数据之间的相似性。

结果

总体而言，合成数据在流行曲线、患者特征和结局方面与原始数据非常匹配。合成数据抑制了总检测量较少的邮政编码标签（平均值=2.9±2.4；最大值=16 次检测；66%减少唯一邮政编码）。在最大测试的随机样本中（前 1%；n=171）和所有未抑制的邮政编码（n=5819）中，合成数据和原始数据的流行曲线和每月指标计数相似。在小样本量下，合成数据的实用性明显降低。

讨论

在人口水平和高度测试的邮政编码（包含大部分数据）上的分析，原始数据集和从合成数据集中得出的数据集之间相似。对稀疏测试人群的分析则不太相似，并且数据抑制更多。

结论

总体而言，合成数据成功地用于分析地理空间和时间趋势。使用小样本量或人群的分析受到限制，部分原因是有针对性的数据标签抑制-一种属性披露对策。在这些情况下，用户应考虑数据的适用性。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/be21/9277637/5e835afabc18/ocac045f1.jpg

相似文献

Demonstrating an approach for evaluating synthetic geospatial and temporal epidemiologic data utility: results from analyzing >1.8 million SARS-CoV-2 tests in the United States National COVID Cohort Collaborative (N3C).展示一种评估合成地理空间和时间流行病学数据效用的方法：对美国国家 COVID 队列协作（N3C）中超过 180 万次 SARS-CoV-2 检测进行分析的结果。

J Am Med Inform Assoc. 2022 Jul 12;29(8):1350-1365. doi: 10.1093/jamia/ocac045.

Demonstrating an approach for evaluating synthetic geospatial and temporal epidemiologic data utility: Results from analyzing >1.8 million SARS-CoV-2 tests in the United States National COVID Cohort Collaborative (N3C).展示一种评估合成地理空间和时间流行病学数据效用的方法：美国国家新冠队列协作项目（N3C）中对超180万次新冠病毒检测分析的结果

medRxiv. 2021 Jul 8:2021.07.06.21259051. doi: 10.1101/2021.07.06.21259051.

The National COVID Cohort Collaborative: Analyses of Original and Computationally Derived Electronic Health Record Data.国家 COVID 队列协作组：原始和计算衍生电子健康记录数据的分析。

J Med Internet Res. 2021 Oct 4;23(10):e30697. doi: 10.2196/30697.

Dialysis, COVID-19, Poverty, and Race in Greater Chicago: An Ecological Analysis.芝加哥大都会区的透析、新冠疫情、贫困与种族：一项生态分析

Kidney Med. 2020 Sep-Oct;2(5):552-558.e1. doi: 10.1016/j.xkme.2020.06.005. Epub 2020 Jul 30.

Social determinants of health and coronavirus disease 2019 in pregnancy.妊娠期的健康社会决定因素与 2019 年冠状病毒病。

Am J Obstet Gynecol MFM. 2021 Jul;3(4):100349. doi: 10.1016/j.ajogmf.2021.100349. Epub 2021 Mar 21.

Disparities in SARS-CoV-2 Positivity Rates: Associations with Race and Ethnicity.SARS-CoV-2 阳性率的差异：与种族和民族的关联。

Popul Health Manag. 2021 Feb;24(1):20-26. doi: 10.1089/pop.2020.0163. Epub 2020 Sep 23.

The National COVID Cohort Collaborative (N3C): Rationale, design, infrastructure, and deployment.国家 COVID 队列协作组织（N3C）：原理、设计、基础设施和部署。

J Am Med Inform Assoc. 2021 Mar 1;28(3):427-443. doi: 10.1093/jamia/ocaa196.

Inequities in COVID-19 vaccine and booster coverage across Massachusetts ZIP codes after the emergence of Omicron: A population-based cross-sectional study.在奥密克戎出现后，马萨诸塞州邮政编码区域内 COVID-19 疫苗和加强针接种的不平等：一项基于人群的横断面研究。

PLoS Med. 2023 Jan 31;20(1):e1004167. doi: 10.1371/journal.pmed.1004167. eCollection 2023 Jan.

Quantification of Occupational and Community Risk Factors for SARS-CoV-2 Seropositivity Among Health Care Workers in a Large U.S. Health Care System.量化大型美国医疗保健系统中卫生保健工作者 SARS-CoV-2 血清阳性的职业和社区风险因素。

Ann Intern Med. 2021 May;174(5):649-654. doi: 10.7326/M20-7145. Epub 2021 Jan 29.

Estimated SARS-CoV-2 Seroprevalence in US Patients Receiving Dialysis 1 Year After the Beginning of the COVID-19 Pandemic.COVID-19 大流行开始一年后接受透析的美国患者中估计的 SARS-CoV-2 血清流行率。

JAMA Netw Open. 2021 Jul 1;4(7):e2116572. doi: 10.1001/jamanetworkopen.2021.16572.

引用本文的文献

Treatment disparities and prognostic implications in octogenarians versus non-octogenarians with high-gradient severe aortic stenosis.高龄与非高龄高梯度重度主动脉瓣狭窄患者的治疗差异及预后影响

Open Heart. 2025 Aug 14;12(2):e003405. doi: 10.1136/openhrt-2025-003405.

AI-driven synthetic data generation for accelerating hepatology research: A study of the United Network for Organ Sharing (UNOS) database.人工智能驱动的合成数据生成以加速肝病学研究：器官共享联合网络（UNOS）数据库研究

Hepatology. 2025 Mar 11. doi: 10.1097/HEP.0000000000001299.

Predicting 5-year dementia conversion in veterans with mild cognitive impairment.预测轻度认知障碍退伍军人的5年痴呆症转化率。

Alzheimers Dement (Amst). 2024 Mar 26;16(1):e12572. doi: 10.1002/dad2.12572. eCollection 2024 Jan-Mar.

[Re-identification potential of structured health data].[结构化健康数据的重新识别潜力]

Bundesgesundheitsblatt Gesundheitsforschung Gesundheitsschutz. 2024 Feb;67(2):164-170. doi: 10.1007/s00103-023-03820-2. Epub 2024 Jan 17.

Synthetic Health Data Can Augment Community Research Efforts to Better Inform the Public During Emerging Pandemics.合成健康数据可助力社区研究工作，以便在新发大流行期间更好地为公众提供信息。

medRxiv. 2023 Dec 13:2023.12.11.23298687. doi: 10.1101/2023.12.11.23298687.

Leveraging the Academic Artificial Intelligence Silecosystem to Advance the Community Oncology Enterprise.利用学术人工智能生态系统推动社区肿瘤事业发展。

J Clin Med. 2023 Jul 21;12(14):4830. doi: 10.3390/jcm12144830.

Keeping synthetic patients on track: feedback mechanisms to mitigate performance drift in longitudinal health data simulation.保持合成患者的轨迹：反馈机制以减轻纵向健康数据模拟中的性能漂移。

J Am Med Inform Assoc. 2022 Oct 7;29(11):1890-1898. doi: 10.1093/jamia/ocac131.

The imperative of applying ethical perspectives to biomedical and health informatics.将伦理视角应用于生物医学和健康信息学的必要性。

J Am Med Inform Assoc. 2022 Jul 12;29(8):1317-1318. doi: 10.1093/jamia/ocac095.

本文引用的文献

J Med Internet Res. 2021 Oct 4;23(10):e30697. doi: 10.2196/30697.

Virtual Cohorts and Synthetic Data in Dementia: An Illustration of Their Potential to Advance Research.痴呆症中的虚拟队列与合成数据：展示其推动研究的潜力

Front Artif Intell. 2021 May 17;4:613956. doi: 10.3389/frai.2021.613956. eCollection 2021.

Citizen Science, Education, and Learning: Challenges and Opportunities.公民科学、教育与学习：挑战与机遇

Front Sociol. 2020 Dec 2;5:613814. doi: 10.3389/fsoc.2020.613814. eCollection 2020.

Evaluating the utility of synthetic COVID-19 case data.评估合成新冠病毒病例数据的效用。

JAMIA Open. 2021 Mar 1;4(1):ooab012. doi: 10.1093/jamiaopen/ooab012. eCollection 2021 Jan.

Spot the difference: comparing results of analyses from real patient data and synthetic derivatives.找出差异：比较来自真实患者数据和合成衍生物的分析结果。

JAMIA Open. 2020 Dec 14;3(4):557-566. doi: 10.1093/jamiaopen/ooaa060. eCollection 2020 Dec.

A call to strengthen data in response to COVID-19 and beyond.呼吁加强应对 COVID-19 及其他传染病的数据工作。

J Am Med Inform Assoc. 2021 Mar 1;28(3):638-639. doi: 10.1093/jamia/ocaa308.

Evaluating Identity Disclosure Risk in Fully Synthetic Health Data: Model Development and Validation.评估完全合成健康数据中的身份披露风险：模型开发与验证

J Med Internet Res. 2020 Nov 16;22(11):e23139. doi: 10.2196/23139.

Publishing volumes in major databases related to Covid-19.在与新冠病毒相关的主要数据库中发表论文。

Scientometrics. 2021;126(1):831-842. doi: 10.1007/s11192-020-03675-3. Epub 2020 Aug 28.

The National COVID Cohort Collaborative (N3C): Rationale, design, infrastructure, and deployment.国家 COVID 队列协作组织（N3C）：原理、设计、基础设施和部署。

J Am Med Inform Assoc. 2021 Mar 1;28(3):427-443. doi: 10.1093/jamia/ocaa196.

Ethics and informatics in the age of COVID-19: challenges and recommendations for public health organization and public policy.COVID-19 时代的伦理学和信息学：公共卫生组织和公共政策面临的挑战和建议。

J Am Med Inform Assoc. 2021 Jan 15;28(1):184-189. doi: 10.1093/jamia/ocaa188.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

展示一种评估合成地理空间和时间流行病学数据效用的方法：对美国国家 COVID 队列协作（N3C）中超过 180 万次 SARS-CoV-2 检测进行分析的结果。

Demonstrating an approach for evaluating synthetic geospatial and temporal epidemiologic data utility: results from analyzing >1.8 million SARS-CoV-2 tests in the United States National COVID Cohort Collaborative (N3C).

机构信息

出版信息

OBJECTIVE

MATERIALS AND METHODS

RESULTS

DISCUSSION

CONCLUSION

目的

材料和方法

结果

讨论

结论

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献