Suppr超能文献

评估合成数据和统计匹配的数据集成效用。

Evaluating the utility of data integration with synthetic data and statistical matching.

作者信息

Ji Eunjeong, Ohn Jung Hun, Jo Hyemin, Park Min-Jeong, Kim Hang J, Shin Cheol Min, Ahn Soyeon

机构信息

Division of Statistics, Medical Research Collaborating Center, Seoul National University Bundang Hospital, Seongnam-si, Gyeonggi-do, 13620, South Korea.

Department of Internal Medicine, Seoul National University Bundang Hospital, Seongnam-si, Gyeonggi-do, 13620, South Korea.

出版信息

Sci Rep. 2025 Sep 1;15(1):19627. doi: 10.1038/s41598-025-01514-0.

Abstract

Data integration enhances dataset utility but raises privacy concerns due to increased disclosure risks. Synthetic data offers a potential solution, though its role in data integration has not been thoroughly investigated. This study assesses synthetic data integration by evaluating the impact of varying common variables during statistical matching and exploring synthetic-real dataset combinations in donor-recipient settings. We used data from the Korean Genome and Epidemiology Study (KoGES) cohort, with the full dataset as the donor and one-quarter of the subjects as the recipient. Multiple synthetic datasets were generated from both datasets, with varying sets of common variables. Statistical matching was conducted using the nearest-neighbor hotdeck method. Data utility was evaluated using confidence interval overlap measures in the hazard ratio estimates under clinical scenarios to predict diabetes onset. When both donor and recipient data were synthetic, the all-available matched data generally outperformed other matching conditions. However, clinically relevant matching variables occasionally showed equivalent performances. The synthetic data showed comparable model accuracy to real data, although further investigation is warranted to understand the performance differences. Statistically matched synthetic data offers utility comparable to real data, providing a potential approach for reducing privacy risks while maintaining data utility.

摘要

数据集成提高了数据集的效用,但由于披露风险增加,引发了隐私问题。合成数据提供了一种潜在的解决方案,尽管其在数据集成中的作用尚未得到充分研究。本研究通过评估统计匹配过程中不同公共变量的影响以及探索供体-受体环境中的合成-真实数据集组合,来评估合成数据集成。我们使用了韩国基因组与流行病学研究(KoGES)队列的数据,将完整数据集作为供体,四分之一的受试者作为受体。从这两个数据集中生成了多个合成数据集,公共变量集各不相同。使用最近邻热盘法进行统计匹配。在临床场景下,通过危险比估计中的置信区间重叠度量来评估数据效用,以预测糖尿病发病。当供体和受体数据均为合成数据时,所有可用的匹配数据通常优于其他匹配条件。然而,临床相关的匹配变量偶尔会表现出相当的性能。尽管有必要进一步研究以了解性能差异,但合成数据显示出与真实数据相当的模型准确性。经统计匹配的合成数据提供了与真实数据相当的效用,为在保持数据效用的同时降低隐私风险提供了一种潜在方法。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/363a/12402339/1ea8d3129f10/41598_2025_1514_Fig1_HTML.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验