评估合成数据和统计匹配的数据集成效用。

Evaluating the utility of data integration with synthetic data and statistical matching.

作者信息

Ji Eunjeong, Ohn Jung Hun, Jo Hyemin, Park Min-Jeong, Kim Hang J, Shin Cheol Min, Ahn Soyeon

机构信息

Division of Statistics, Medical Research Collaborating Center, Seoul National University Bundang Hospital, Seongnam-si, Gyeonggi-do, 13620, South Korea.

Department of Internal Medicine, Seoul National University Bundang Hospital, Seongnam-si, Gyeonggi-do, 13620, South Korea.

出版信息

Sci Rep. 2025 Sep 1;15(1):19627. doi: 10.1038/s41598-025-01514-0.

DOI:10.1038/s41598-025-01514-0

PMID:40890136

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12402339/

Abstract

Data integration enhances dataset utility but raises privacy concerns due to increased disclosure risks. Synthetic data offers a potential solution, though its role in data integration has not been thoroughly investigated. This study assesses synthetic data integration by evaluating the impact of varying common variables during statistical matching and exploring synthetic-real dataset combinations in donor-recipient settings. We used data from the Korean Genome and Epidemiology Study (KoGES) cohort, with the full dataset as the donor and one-quarter of the subjects as the recipient. Multiple synthetic datasets were generated from both datasets, with varying sets of common variables. Statistical matching was conducted using the nearest-neighbor hotdeck method. Data utility was evaluated using confidence interval overlap measures in the hazard ratio estimates under clinical scenarios to predict diabetes onset. When both donor and recipient data were synthetic, the all-available matched data generally outperformed other matching conditions. However, clinically relevant matching variables occasionally showed equivalent performances. The synthetic data showed comparable model accuracy to real data, although further investigation is warranted to understand the performance differences. Statistically matched synthetic data offers utility comparable to real data, providing a potential approach for reducing privacy risks while maintaining data utility.

摘要

数据集成提高了数据集的效用，但由于披露风险增加，引发了隐私问题。合成数据提供了一种潜在的解决方案，尽管其在数据集成中的作用尚未得到充分研究。本研究通过评估统计匹配过程中不同公共变量的影响以及探索供体-受体环境中的合成-真实数据集组合，来评估合成数据集成。我们使用了韩国基因组与流行病学研究（KoGES）队列的数据，将完整数据集作为供体，四分之一的受试者作为受体。从这两个数据集中生成了多个合成数据集，公共变量集各不相同。使用最近邻热盘法进行统计匹配。在临床场景下，通过危险比估计中的置信区间重叠度量来评估数据效用，以预测糖尿病发病。当供体和受体数据均为合成数据时，所有可用的匹配数据通常优于其他匹配条件。然而，临床相关的匹配变量偶尔会表现出相当的性能。尽管有必要进一步研究以了解性能差异，但合成数据显示出与真实数据相当的模型准确性。经统计匹配的合成数据提供了与真实数据相当的效用，为在保持数据效用的同时降低隐私风险提供了一种潜在方法。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/363a/12402339/1ea8d3129f10/41598_2025_1514_Fig1_HTML.jpg

相似文献

Evaluating the utility of data integration with synthetic data and statistical matching.评估合成数据和统计匹配的数据集成效用。

Sci Rep. 2025 Sep 1;15(1):19627. doi: 10.1038/s41598-025-01514-0.

Prescription of Controlled Substances: Benefits and Risks管制药品的处方：益处与风险

Comparison of cellulose, modified cellulose and synthetic membranes in the haemodialysis of patients with end-stage renal disease.纤维素、改性纤维素和合成膜在终末期肾病患者血液透析中的比较。

Cochrane Database Syst Rev. 2001(3):CD003234. doi: 10.1002/14651858.CD003234.

Eliciting adverse effects data from participants in clinical trials.从临床试验参与者中获取不良反应数据。

Cochrane Database Syst Rev. 2018 Jan 16;1(1):MR000039. doi: 10.1002/14651858.MR000039.pub2.

Comparison of Two Modern Survival Prediction Tools, SORG-MLA and METSSS, in Patients With Symptomatic Long-bone Metastases Who Underwent Local Treatment With Surgery Followed by Radiotherapy and With Radiotherapy Alone.两种现代生存预测工具 SORG-MLA 和 METSSS 在接受手术联合放疗和单纯放疗治疗有症状长骨转移患者中的比较。

Clin Orthop Relat Res. 2024 Dec 1;482(12):2193-2208. doi: 10.1097/CORR.0000000000003185. Epub 2024 Jul 23.

Drugs for preventing postoperative nausea and vomiting in adults after general anaesthesia: a network meta-analysis.成人全身麻醉后预防术后恶心呕吐的药物：网状Meta分析

Cochrane Database Syst Rev. 2020 Oct 19;10(10):CD012859. doi: 10.1002/14651858.CD012859.pub2.

Signs and symptoms to determine if a patient presenting in primary care or hospital outpatient settings has COVID-19.在基层医疗机构或医院门诊环境中，如果患者出现以下症状和体征，可判断其是否患有 COVID-19。

Cochrane Database Syst Rev. 2022 May 20;5(5):CD013665. doi: 10.1002/14651858.CD013665.pub3.

Pharmacological treatment of children with gastro-oesophageal reflux.胃食管反流患儿的药物治疗

Cochrane Database Syst Rev. 2014 Nov 24;2014(11):CD008550. doi: 10.1002/14651858.CD008550.pub2.

Sex and gender as predictors for allograft and patient-relevant outcomes after kidney transplantation.性别作为肾移植后同种异体移植及患者相关预后的预测因素。

Cochrane Database Syst Rev. 2024 Dec 19;12(12):CD014966. doi: 10.1002/14651858.CD014966.pub2.

Systemic pharmacological treatments for chronic plaque psoriasis: a network meta-analysis.系统性药理学治疗慢性斑块状银屑病：网络荟萃分析。

Cochrane Database Syst Rev. 2021 Apr 19;4(4):CD011535. doi: 10.1002/14651858.CD011535.pub4.

本文引用的文献

Insulin Resistance and Impaired Insulin Secretion Predict Incident Diabetes: A Statistical Matching Application to the Two Korean Nationwide, Population-Representative Cohorts.胰岛素抵抗和胰岛素分泌受损预测糖尿病的发生：基于两个韩国全国代表性人群队列的统计学匹配应用。

Endocrinol Metab (Seoul). 2024 Oct;39(5):711-721. doi: 10.3803/EnM.2024.1986. Epub 2024 Aug 30.

An overview of synthetic administrative data for research.合成行政数据研究概述。

Int J Popul Data Sci. 2022 May 23;7(1):1727. doi: 10.23889/ijpds.v7i1.1727. eCollection 2022.

Techniques to produce and evaluate realistic multivariate synthetic data.生成和评估逼真的多变量合成数据的技术。

Sci Rep. 2023 Jul 28;13(1):12266. doi: 10.1038/s41598-023-38832-0.

Diagnosis of Obesity: 2022 Update of Clinical Practice Guidelines for Obesity by the Korean Society for the Study of Obesity.肥胖症的诊断：韩国肥胖研究学会《2022年肥胖症临床实践指南更新》

J Obes Metab Syndr. 2023 Jun 30;32(2):121-129. doi: 10.7570/jomes23031.

Generating synthetic mixed-type longitudinal electronic health records for artificial intelligent applications.为人工智能应用生成合成混合型纵向电子健康记录。

NPJ Digit Med. 2023 May 27;6(1):98. doi: 10.1038/s41746-023-00834-7.

A method for generating synthetic longitudinal health data.一种生成合成纵向健康数据的方法。

BMC Med Res Methodol. 2023 Mar 23;23(1):67. doi: 10.1186/s12874-023-01869-w.

Association of weight status and the risks of diabetes in adults: a systematic review and meta-analysis of prospective cohort studies.体重状况与成年人糖尿病风险的关联：前瞻性队列研究的系统评价和荟萃分析。

Int J Obes (Lond). 2022 Jun;46(6):1101-1113. doi: 10.1038/s41366-022-01096-1. Epub 2022 Feb 23.

The effects of the national HPV vaccination programme in England, UK, on cervical cancer and grade 3 cervical intraepithelial neoplasia incidence: a register-based observational study.英国英格兰国家 HPV 疫苗接种计划对宫颈癌和 3 级宫颈上皮内瘤变发病率的影响：基于登记的观察性研究。

Lancet. 2021 Dec 4;398(10316):2084-2092. doi: 10.1016/S0140-6736(21)02178-4. Epub 2021 Nov 3.

Using Synthetic Data to Replace Linkage Derived Elements: A Case Study.使用合成数据替代连锁推导元素：一个案例研究。

Health Serv Outcomes Res Methodol. 2021 Feb 3;21:389-406. doi: 10.1007/s10742-021-00241-z.

Generating high-fidelity synthetic patient data for assessing machine learning healthcare software.生成用于评估机器学习医疗软件的高保真合成患者数据。

NPJ Digit Med. 2020 Nov 9;3(1):147. doi: 10.1038/s41746-020-00353-9.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

评估合成数据和统计匹配的数据集成效用。

Evaluating the utility of data integration with synthetic data and statistical matching.

作者信息

机构信息

出版信息

相似文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

本文引用的文献