Suppr超能文献

数据合成的重塑:保留缺失模式以增强分析。

Data Synthesis Reinvented: Preserving Missing Patterns for Enhanced Analysis.

作者信息

Wang Xinyue, Asif Hafiz, Gupta Shashank, Vaidya Jaideep

机构信息

Renmin University, Beijing, China.

Hofstra University, Long Island, NY, USA.

出版信息

IEEE Trans Knowl Data Eng. 2025 Jul;37(7):3962-3975. doi: 10.1109/tkde.2025.3563319. Epub 2025 Apr 22.

Abstract

Synthetic data is being widely used as a replacement or enhancement for real data in fields as diverse as healthcare, telecommunications, and finance. Unlike real data, which represents actual people and objects, synthetic data is generated from an estimated distribution that retains key statistical properties of the real data. This makes synthetic data attractive for sharing while addressing privacy, confidentiality, and autonomy concerns. Real data often contains missing values that hold important information about individual, system, or organizational behavior. Standard synthetic data generation methods eliminate missing values as part of their pre-processing steps and thus completely ignore this valuable source of information. Instead, we propose methods to generate synthetic data that preserve both the observable and missing data distributions; consequently, retaining the valuable information encoded in the missing patterns of the real data. Our approach handles various missing data scenarios and can easily integrate with existing data generation methods. Extensive empirical evaluations on diverse datasets demonstrate the effectiveness of our approach as well as the value of preserving missing data distribution in synthetic data.

摘要

合成数据正在广泛应用于医疗保健、电信和金融等众多领域,以替代或增强真实数据。与代表实际人员和对象的真实数据不同,合成数据是从保留真实数据关键统计属性的估计分布中生成的。这使得合成数据在解决隐私、保密和自主性问题的同时,对于共享具有吸引力。真实数据通常包含缺失值,这些缺失值包含有关个人、系统或组织行为的重要信息。标准的合成数据生成方法在其预处理步骤中消除缺失值,从而完全忽略了这个有价值的信息来源。相反,我们提出了生成合成数据的方法,该方法既能保留可观察到的数据分布,又能保留缺失数据的分布;因此,保留了编码在真实数据缺失模式中的有价值信息。我们的方法可以处理各种缺失数据场景,并且可以轻松地与现有的数据生成方法集成。对各种数据集进行的广泛实证评估证明了我们方法的有效性,以及在合成数据中保留缺失数据分布的价值。

相似文献

1
Data Synthesis Reinvented: Preserving Missing Patterns for Enhanced Analysis.数据合成的重塑:保留缺失模式以增强分析。
IEEE Trans Knowl Data Eng. 2025 Jul;37(7):3962-3975. doi: 10.1109/tkde.2025.3563319. Epub 2025 Apr 22.

本文引用的文献

1
Preserving Missing Data Distribution in Synthetic Data.在合成数据中保留缺失数据分布
Proc Int World Wide Web Conf. 2023 Apr-May;2023:2110-2121. doi: 10.1145/3543507.3583297. Epub 2023 Apr 30.
3
An overview of synthetic administrative data for research.合成行政数据研究概述。
Int J Popul Data Sci. 2022 May 23;7(1):1727. doi: 10.23889/ijpds.v7i1.1727. eCollection 2022.
6
Generation and evaluation of synthetic patient data.生成和评估合成患者数据。
BMC Med Res Methodol. 2020 May 7;20(1):108. doi: 10.1186/s12874-020-00977-1.

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验