• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

利用合成健康数据评估分析结果的可重复性。

An evaluation of the replicability of analyses using synthetic health data.

机构信息

School of Epidemiology and Public Health, University of Ottawa, Ottawa, ON, Canada.

Replica Analytics, Ottawa, ON, Canada.

出版信息

Sci Rep. 2024 Mar 24;14(1):6978. doi: 10.1038/s41598-024-57207-7.

DOI:10.1038/s41598-024-57207-7
PMID:38521806
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC10960851/
Abstract

Synthetic data generation is being increasingly used as a privacy preserving approach for sharing health data. In addition to protecting privacy, it is important to ensure that generated data has high utility. A common way to assess utility is the ability of synthetic data to replicate results from the real data. Replicability has been defined using two criteria: (a) replicate the results of the analyses on real data, and (b) ensure valid population inferences from the synthetic data. A simulation study using three heterogeneous real-world datasets evaluated the replicability of logistic regression workloads. Eight replicability metrics were evaluated: decision agreement, estimate agreement, standardized difference, confidence interval overlap, bias, confidence interval coverage, statistical power, and precision (empirical SE). The analysis of synthetic data used a multiple imputation approach whereby up to 20 datasets were generated and the fitted logistic regression models were combined using combining rules for fully synthetic datasets. The effects of synthetic data amplification were evaluated, and two types of generative models were used: sequential synthesis using boosted decision trees and a generative adversarial network (GAN). Privacy risk was evaluated using a membership disclosure metric. For sequential synthesis, adjusted model parameters after combining at least ten synthetic datasets gave high decision and estimate agreement, low standardized difference, as well as high confidence interval overlap, low bias, the confidence interval had nominal coverage, and power close to the nominal level. Amplification had only a marginal benefit. Confidence interval coverage from a single synthetic dataset without applying combining rules were erroneous, and statistical power, as expected, was artificially inflated when amplification was used. Sequential synthesis performed considerably better than the GAN across multiple datasets. Membership disclosure risk was low for all datasets and models. For replicable results, the statistical analysis of fully synthetic data should be based on at least ten generated datasets of the same size as the original whose analyses results are combined. Analysis results from synthetic data without applying combining rules can be misleading. Replicability results are dependent on the type of generative model used, with our study suggesting that sequential synthesis has good replicability characteristics for common health research workloads.

摘要

合成数据生成正被越来越多地用作共享健康数据的隐私保护方法。除了保护隐私之外,确保生成的数据具有高实用性也很重要。评估实用性的一种常见方法是合成数据复制真实数据结果的能力。可重复性已使用两个标准定义:(a) 复制真实数据上的分析结果,以及 (b) 确保从合成数据中进行有效的总体推断。使用三个异构真实世界数据集进行的模拟研究评估了逻辑回归工作负载的可重复性。评估了八种可重复性指标:决策一致性、估计一致性、标准化差异、置信区间重叠、偏差、置信区间覆盖、统计功效和精度(经验 SE)。合成数据的分析使用了多次插补方法,最多可以生成 20 个数据集,并使用完全合成数据集的组合规则来组合拟合的逻辑回归模型。评估了合成数据放大的效果,并使用了两种生成模型:使用提升决策树的顺序合成和生成对抗网络 (GAN)。使用成员披露指标评估隐私风险。对于顺序合成,在组合至少十个合成数据集后调整模型参数可提供高决策和估计一致性、低标准化差异以及高置信区间重叠、低偏差、置信区间具有名义覆盖范围以及接近名义水平的功效。放大只有微小的好处。没有应用组合规则的单个合成数据集的置信区间覆盖范围是错误的,并且如预期的那样,当放大使用时,统计功效会被人为夸大。在多个数据集上,顺序合成的表现明显优于 GAN。对于所有数据集和模型,成员披露风险都很低。对于可重复的结果,完全合成数据的统计分析应该基于至少十个与原始数据大小相同的生成数据集,并且分析结果是组合的。不应用组合规则的合成数据的分析结果可能会产生误导。可重复性结果取决于所使用的生成模型类型,我们的研究表明,顺序合成对于常见的健康研究工作负载具有良好的可重复性特征。

相似文献

1
An evaluation of the replicability of analyses using synthetic health data.利用合成健康数据评估分析结果的可重复性。
Sci Rep. 2024 Mar 24;14(1):6978. doi: 10.1038/s41598-024-57207-7.
2
Augmenting Insufficiently Accruing Oncology Clinical Trials Using Generative Models: Validation Study.使用生成模型增强入组不足的肿瘤学临床试验:验证研究
J Med Internet Res. 2025 Mar 5;27:e66821. doi: 10.2196/66821.
3
Validating a membership disclosure metric for synthetic health data.验证合成健康数据的成员披露指标。
JAMIA Open. 2022 Oct 11;5(4):ooac083. doi: 10.1093/jamiaopen/ooac083. eCollection 2022 Dec.
4
Evaluating the Utility and Privacy of Synthetic Breast Cancer Clinical Trial Data Sets.评估合成乳腺癌临床试验数据集的效用和隐私性。
JCO Clin Cancer Inform. 2023 Sep;7:e2300116. doi: 10.1200/CCI.23.00116.
5
Validation Assessment of Privacy-Preserving Synthetic Electronic Health Record Data: Comparison of Original Versus Synthetic Data on Real-World COVID-19 Vaccine Effectiveness.隐私保护合成电子健康记录数据的验证评估:真实世界 COVID-19 疫苗有效性的原始数据与合成数据比较。
Pharmacoepidemiol Drug Saf. 2024 Oct;33(10):e70019. doi: 10.1002/pds.70019.
6
Folic acid supplementation and malaria susceptibility and severity among people taking antifolate antimalarial drugs in endemic areas.在流行地区,服用抗叶酸抗疟药物的人群中,叶酸补充剂与疟疾易感性和严重程度的关系。
Cochrane Database Syst Rev. 2022 Feb 1;2(2022):CD014217. doi: 10.1002/14651858.CD014217.
7
A method for generating synthetic longitudinal health data.一种生成合成纵向健康数据的方法。
BMC Med Res Methodol. 2023 Mar 23;23(1):67. doi: 10.1186/s12874-023-01869-w.
8
Utility-based Analysis of Statistical Approaches and Deep Learning Models for Synthetic Data Generation With Focus on Correlation Structures: Algorithm Development and Validation.基于效用的统计方法和深度学习模型用于合成数据生成的分析,重点关注相关结构:算法开发与验证
JMIR AI. 2025 Mar 20;4:e65729. doi: 10.2196/65729.
9
Evaluating the utility of synthetic COVID-19 case data.评估合成新冠病毒病例数据的效用。
JAMIA Open. 2021 Mar 1;4(1):ooab012. doi: 10.1093/jamiaopen/ooab012. eCollection 2021 Jan.
10
A synthetic dataset primer for the biobehavioural sciences to promote reproducibility and hypothesis generation.生物行为科学合成数据集入门,促进可重复性和假设生成。
Elife. 2020 Mar 11;9:e53275. doi: 10.7554/eLife.53275.

引用本文的文献

1
Magnitude and Impact of Hallucinations in Tabular Synthetic Health Data on Prognostic Machine Learning Models: Validation Study.表格合成健康数据中的幻觉对预后机器学习模型的影响程度及验证研究
J Med Internet Res. 2025 Aug 18;27:e77893. doi: 10.2196/77893.
2
Synthetic Data in Healthcare and Drug Development: Definitions, Regulatory Frameworks, Issues.医疗保健与药物研发中的合成数据:定义、监管框架、问题
CPT Pharmacometrics Syst Pharmacol. 2025 May;14(5):840-852. doi: 10.1002/psp4.70021. Epub 2025 Apr 7.
3
Augmenting Insufficiently Accruing Oncology Clinical Trials Using Generative Models: Validation Study.

本文引用的文献

1
Synthetic data in cancer and cerebrovascular disease research: A novel approach to big data.癌症和脑血管病研究中的合成数据:大数据的一种新方法。
PLoS One. 2024 Feb 7;19(2):e0295921. doi: 10.1371/journal.pone.0295921. eCollection 2024.
2
Evaluating the Utility and Privacy of Synthetic Breast Cancer Clinical Trial Data Sets.评估合成乳腺癌临床试验数据集的效用和隐私性。
JCO Clin Cancer Inform. 2023 Sep;7:e2300116. doi: 10.1200/CCI.23.00116.
3
Real-world data emulating randomized controlled trials of non-vitamin K antagonist oral anticoagulants in patients with venous thromboembolism.
使用生成模型增强入组不足的肿瘤学临床试验:验证研究
J Med Internet Res. 2025 Mar 5;27:e66821. doi: 10.2196/66821.
4
Semisynthetic simulation for microbiome data analysis.用于微生物组数据分析的半合成模拟
Brief Bioinform. 2024 Nov 22;26(1). doi: 10.1093/bib/bbaf051.
5
To be or not to be, when synthetic data meet clinical pharmacology: A focused study on pharmacogenetics.合成数据与临床药理学相遇时:生存与否——一项关于药物遗传学的重点研究
CPT Pharmacometrics Syst Pharmacol. 2025 Jan;14(1):82-94. doi: 10.1002/psp4.13240. Epub 2024 Oct 16.
6
Synthetic data generation methods in healthcare: A review on open-source tools and methods.医疗保健领域的合成数据生成方法:关于开源工具和方法的综述
Comput Struct Biotechnol J. 2024 Jul 9;23:2892-2910. doi: 10.1016/j.csbj.2024.07.005. eCollection 2024 Dec.
真实世界数据模拟非维生素 K 拮抗剂口服抗凝剂在静脉血栓栓塞患者中的随机对照试验。
BMC Med. 2023 Sep 29;21(1):375. doi: 10.1186/s12916-023-03069-1.
4
Opportunities and Challenges of Synthetic Data Generation in Oncology.肿瘤学中合成数据生成的机遇与挑战。
JCO Clin Cancer Inform. 2023 Aug;7:e2300045. doi: 10.1200/CCI.23.00045.
5
A comparison of synthetic data generation and federated analysis for enabling international evaluations of cardiovascular health.利用合成数据生成和联邦分析促进心血管健康的国际评估比较。
Sci Rep. 2023 Jul 17;13(1):11540. doi: 10.1038/s41598-023-38457-3.
6
Emulation of Randomized Clinical Trials With Nonrandomized Database Analyses: Results of 32 Clinical Trials.非随机数据库分析模拟随机临床试验:32 项临床试验的结果。
JAMA. 2023 Apr 25;329(16):1376-1385. doi: 10.1001/jama.2023.4221.
7
A Multifaceted benchmarking of synthetic electronic health record generation models.综合电子健康记录生成模型的多方面基准测试。
Nat Commun. 2022 Dec 9;13(1):7609. doi: 10.1038/s41467-022-35295-1.
8
Validating a membership disclosure metric for synthetic health data.验证合成健康数据的成员披露指标。
JAMIA Open. 2022 Oct 11;5(4):ooac083. doi: 10.1093/jamiaopen/ooac083. eCollection 2022 Dec.
9
Can Observational Analyses of Routinely Collected Data Emulate Randomized Trials? Design and Feasibility of the Observational Patient Evidence for Regulatory Approval Science and Understanding Disease Project.常规收集数据的观察性分析能否模拟随机试验?观察性患者证据用于监管批准科学和疾病理解项目的设计和可行性。
Value Health. 2023 Feb;26(2):176-184. doi: 10.1016/j.jval.2022.07.003. Epub 2022 Aug 13.
10
Using synthetic data to improve the reproducibility of statistical results in psychological research.利用合成数据提高心理学研究中统计结果的可重复性。
Psychol Methods. 2024 Aug;29(4):789-806. doi: 10.1037/met0000526. Epub 2022 Aug 4.