数据合成的重塑：保留缺失模式以增强分析。

Data Synthesis Reinvented: Preserving Missing Patterns for Enhanced Analysis.

作者信息

Wang Xinyue, Asif Hafiz, Gupta Shashank, Vaidya Jaideep

机构信息

Renmin University, Beijing, China.

Hofstra University, Long Island, NY, USA.

出版信息

IEEE Trans Knowl Data Eng. 2025 Jul;37(7):3962-3975. doi: 10.1109/tkde.2025.3563319. Epub 2025 Apr 22.

DOI:10.1109/tkde.2025.3563319

PMID:40727435

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12290931/

Abstract

Synthetic data is being widely used as a replacement or enhancement for real data in fields as diverse as healthcare, telecommunications, and finance. Unlike real data, which represents actual people and objects, synthetic data is generated from an estimated distribution that retains key statistical properties of the real data. This makes synthetic data attractive for sharing while addressing privacy, confidentiality, and autonomy concerns. Real data often contains missing values that hold important information about individual, system, or organizational behavior. Standard synthetic data generation methods eliminate missing values as part of their pre-processing steps and thus completely ignore this valuable source of information. Instead, we propose methods to generate synthetic data that preserve both the observable and missing data distributions; consequently, retaining the valuable information encoded in the missing patterns of the real data. Our approach handles various missing data scenarios and can easily integrate with existing data generation methods. Extensive empirical evaluations on diverse datasets demonstrate the effectiveness of our approach as well as the value of preserving missing data distribution in synthetic data.

摘要

合成数据正在广泛应用于医疗保健、电信和金融等众多领域，以替代或增强真实数据。与代表实际人员和对象的真实数据不同，合成数据是从保留真实数据关键统计属性的估计分布中生成的。这使得合成数据在解决隐私、保密和自主性问题的同时，对于共享具有吸引力。真实数据通常包含缺失值，这些缺失值包含有关个人、系统或组织行为的重要信息。标准的合成数据生成方法在其预处理步骤中消除缺失值，从而完全忽略了这个有价值的信息来源。相反，我们提出了生成合成数据的方法，该方法既能保留可观察到的数据分布，又能保留缺失数据的分布；因此，保留了编码在真实数据缺失模式中的有价值信息。我们的方法可以处理各种缺失数据场景，并且可以轻松地与现有的数据生成方法集成。对各种数据集进行的广泛实证评估证明了我们方法的有效性，以及在合成数据中保留缺失数据分布的价值。

相似文献

Data Synthesis Reinvented: Preserving Missing Patterns for Enhanced Analysis.数据合成的重塑：保留缺失模式以增强分析。

IEEE Trans Knowl Data Eng. 2025 Jul;37(7):3962-3975. doi: 10.1109/tkde.2025.3563319. Epub 2025 Apr 22.

Sexual Harassment and Prevention Training性骚扰与预防培训

Short-Term Memory Impairment短期记忆障碍

Gender differences in the context of interventions for improving health literacy in migrants: a qualitative evidence synthesis.移民健康素养提升干预措施背景下的性别差异：一项定性证据综合分析

Cochrane Database Syst Rev. 2024 Dec 12;12(12):CD013302. doi: 10.1002/14651858.CD013302.pub2.

Survivor, family and professional experiences of psychosocial interventions for sexual abuse and violence: a qualitative evidence synthesis.性虐待和暴力的心理社会干预的幸存者、家庭和专业人员的经验：定性证据综合。

Cochrane Database Syst Rev. 2022 Oct 4;10(10):CD013648. doi: 10.1002/14651858.CD013648.pub2.

Systemic pharmacological treatments for chronic plaque psoriasis: a network meta-analysis.系统性药理学治疗慢性斑块状银屑病：网络荟萃分析。

Cochrane Database Syst Rev. 2021 Apr 19;4(4):CD011535. doi: 10.1002/14651858.CD011535.pub4.

Magnetic resonance perfusion for differentiating low-grade from high-grade gliomas at first presentation.首次就诊时磁共振灌注成像用于鉴别低级别与高级别胶质瘤

Cochrane Database Syst Rev. 2018 Jan 22;1(1):CD011551. doi: 10.1002/14651858.CD011551.pub2.

The quantity, quality and findings of network meta-analyses evaluating the effectiveness of GLP-1 RAs for weight loss: a scoping review.评估胰高血糖素样肽-1受体激动剂（GLP-1 RAs）减肥效果的网状Meta分析的数量、质量及结果：一项范围综述

Health Technol Assess. 2025 Jun 25:1-73. doi: 10.3310/SKHT8119.

The Lived Experience of Autistic Adults in Employment: A Systematic Search and Synthesis.成年自闭症患者的就业生活经历：系统检索与综述

Autism Adulthood. 2024 Dec 2;6(4):495-509. doi: 10.1089/aut.2022.0114. eCollection 2024 Dec.

Signs and symptoms to determine if a patient presenting in primary care or hospital outpatient settings has COVID-19.在基层医疗机构或医院门诊环境中，如果患者出现以下症状和体征，可判断其是否患有 COVID-19。

Cochrane Database Syst Rev. 2022 May 20;5(5):CD013665. doi: 10.1002/14651858.CD013665.pub3.

本文引用的文献

Preserving Missing Data Distribution in Synthetic Data.在合成数据中保留缺失数据分布

Proc Int World Wide Web Conf. 2023 Apr-May;2023:2110-2121. doi: 10.1145/3543507.3583297. Epub 2023 Apr 30.

Missing Data Statistics Provide Causal Insights into Data Loss in Diabetes Health Monitoring by Wearable Sensors.缺失数据统计为可穿戴传感器监测糖尿病健康数据丢失提供了因果见解。

Sensors (Basel). 2024 Feb 27;24(5):1526. doi: 10.3390/s24051526.

An overview of synthetic administrative data for research.合成行政数据研究概述。

Int J Popul Data Sci. 2022 May 23;7(1):1727. doi: 10.23889/ijpds.v7i1.1727. eCollection 2022.

A Study of Users' Privacy Preferences for Data Sharing on Symptoms-Tracking/Health App.一项关于症状追踪/健康应用程序数据共享的用户隐私偏好研究。

Proc ACM Workshop Priv Electron Soc. 2022 Nov;2022:109-113. doi: 10.1145/3559613.3563202. Epub 2022 Nov 7.

Generating high-fidelity synthetic patient data for assessing machine learning healthcare software.生成用于评估机器学习医疗软件的高保真合成患者数据。

NPJ Digit Med. 2020 Nov 9;3(1):147. doi: 10.1038/s41746-020-00353-9.

Generation and evaluation of synthetic patient data.生成和评估合成患者数据。

BMC Med Res Methodol. 2020 May 7;20(1):108. doi: 10.1186/s12874-020-00977-1.

Analyzing Missing Data in Perinatal Pharmacoepidemiology Research: Methodological Considerations to Limit the Risk of Bias.围生期药物流行病学研究中缺失数据的分析：限制偏倚风险的方法学考虑。

Clin Ther. 2019 Dec;41(12):2477-2487. doi: 10.1016/j.clinthera.2019.11.003. Epub 2019 Nov 30.

Missing-Values Imputation Algorithms for Microarray Gene Expression Data.用于微阵列基因表达数据的缺失值插补算法

Methods Mol Biol. 2019;1986:255-266. doi: 10.1007/978-1-4939-9442-7_12.

SynSys: A Synthetic Data Generation System for Healthcare Applications.SynSys：一种面向医疗保健应用的合成数据生成系统。

Sensors (Basel). 2019 Mar 8;19(5):1181. doi: 10.3390/s19051181.

A Scalable Privacy-preserving Data Generation Methodology for Exploratory Analysis.一种用于探索性分析的可扩展隐私保护数据生成方法。

AMIA Annu Symp Proc. 2018 Apr 16;2017:1695-1704. eCollection 2017.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验