Wang Xinyue, Asif Hafiz, Gupta Shashank, Vaidya Jaideep
Renmin University, Beijing, China.
Hofstra University, Long Island, NY, USA.
IEEE Trans Knowl Data Eng. 2025 Jul;37(7):3962-3975. doi: 10.1109/tkde.2025.3563319. Epub 2025 Apr 22.
Synthetic data is being widely used as a replacement or enhancement for real data in fields as diverse as healthcare, telecommunications, and finance. Unlike real data, which represents actual people and objects, synthetic data is generated from an estimated distribution that retains key statistical properties of the real data. This makes synthetic data attractive for sharing while addressing privacy, confidentiality, and autonomy concerns. Real data often contains missing values that hold important information about individual, system, or organizational behavior. Standard synthetic data generation methods eliminate missing values as part of their pre-processing steps and thus completely ignore this valuable source of information. Instead, we propose methods to generate synthetic data that preserve both the observable and missing data distributions; consequently, retaining the valuable information encoded in the missing patterns of the real data. Our approach handles various missing data scenarios and can easily integrate with existing data generation methods. Extensive empirical evaluations on diverse datasets demonstrate the effectiveness of our approach as well as the value of preserving missing data distribution in synthetic data.
合成数据正在广泛应用于医疗保健、电信和金融等众多领域,以替代或增强真实数据。与代表实际人员和对象的真实数据不同,合成数据是从保留真实数据关键统计属性的估计分布中生成的。这使得合成数据在解决隐私、保密和自主性问题的同时,对于共享具有吸引力。真实数据通常包含缺失值,这些缺失值包含有关个人、系统或组织行为的重要信息。标准的合成数据生成方法在其预处理步骤中消除缺失值,从而完全忽略了这个有价值的信息来源。相反,我们提出了生成合成数据的方法,该方法既能保留可观察到的数据分布,又能保留缺失数据的分布;因此,保留了编码在真实数据缺失模式中的有价值信息。我们的方法可以处理各种缺失数据场景,并且可以轻松地与现有的数据生成方法集成。对各种数据集进行的广泛实证评估证明了我们方法的有效性,以及在合成数据中保留缺失数据分布的价值。