Suppr超能文献

评估合成新冠病毒病例数据的效用。

Evaluating the utility of synthetic COVID-19 case data.

作者信息

El Emam Khaled, Mosquera Lucy, Jonker Elizabeth, Sood Harpreet

机构信息

School of Epidemiology and Public Health, University of Ottawa, Ottawa, Ontario, Canada.

Electronic Health Information Laboratory, Childrens Hospital of Eastern Ontario Research Institute, Ottawa, Ontario, Canada.

出版信息

JAMIA Open. 2021 Mar 1;4(1):ooab012. doi: 10.1093/jamiaopen/ooab012. eCollection 2021 Jan.

Abstract

BACKGROUND

Concerns about patient privacy have limited access to COVID-19 datasets. Data synthesis is one approach for making such data broadly available to the research community in a privacy protective manner.

OBJECTIVES

Evaluate the utility of synthetic data by comparing analysis results between real and synthetic data.

METHODS

A gradient boosted classification tree was built to predict death using Ontario's 90 514 COVID-19 case records linked with community comorbidity, demographic, and socioeconomic characteristics. Model accuracy and relationships were evaluated, as well as privacy risks. The same model was developed on a synthesized dataset and compared to one from the original data.

RESULTS

The AUROC and AUPRC for the real data model were 0.945 [95% confidence interval (CI), 0.941-0.948] and 0.34 (95% CI, 0.313-0.368), respectively. The synthetic data model had AUROC and AUPRC of 0.94 (95% CI, 0.936-0.944) and 0.313 (95% CI, 0.286-0.342) with confidence interval overlap of 45.05% and 52.02% when compared with the real data. The most important predictors of death for the real and synthetic models were in descending order: age, days since January 1, 2020, type of exposure, and gender. The functional relationships were similar between the two data sets. Attribute disclosure risks were 0.0585, and membership disclosure risk was low.

CONCLUSIONS

This synthetic dataset could be used as a proxy for the real dataset.

摘要

背景

对患者隐私的担忧限制了对新冠病毒疾病数据集的访问。数据合成是一种以隐私保护方式使此类数据广泛提供给研究界的方法。

目的

通过比较真实数据和合成数据的分析结果来评估合成数据的效用。

方法

构建梯度提升分类树,使用安大略省90514例与社区共病、人口统计学和社会经济特征相关联的新冠病毒疾病病例记录来预测死亡情况。评估模型准确性和关系以及隐私风险。在合成数据集上开发相同模型,并与原始数据模型进行比较。

结果

真实数据模型的曲线下面积(AUROC)和精确召回率曲线下面积(AUPRC)分别为0.945 [95%置信区间(CI),0.941 - 0.948]和0.34(95% CI,0.313 - 0.368)。合成数据模型的AUROC和AUPRC分别为0.94(95% CI,0.936 - 0.944)和0.313(95% CI,0.286 - 0.342),与真实数据相比,置信区间重叠率分别为45.05%和52.02%。真实模型和合成模型中死亡的最重要预测因素按降序排列为:年龄、自2020年1月1日起的天数、接触类型和性别。两个数据集之间的功能关系相似。属性披露风险为0.0585,成员披露风险较低。

结论

该合成数据集可作为真实数据集的替代。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a8fe/7936723/e2616f90515b/ooab012f1.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验