Alqulaity Malak, Yang Po
Department of Computer Science, University of Sheffield, Sheffield S1 4DP, UK.
Sensors (Basel). 2024 Nov 30;24(23):7673. doi: 10.3390/s24237673.
The generation of synthetic tabular data has emerged as a critical task in various fields, particularly in healthcare, where data privacy concerns limit the availability of real datasets for research and analysis. This paper presents an enhanced Conditional Generative Adversarial Network (GAN) architecture designed for generating high-quality synthetic tabular data, with a focus on cardiovascular disease datasets that encompass mixed data types and complex feature relationships. The proposed architecture employs specialized sub-networks to process continuous and categorical variables separately, leveraging metadata such as Gaussian Mixture Model (GMM) parameters for continuous attributes and embedding layers for categorical features. By integrating these specialized pathways, the generator produces synthetic samples that closely mimic the statistical properties of the real data. Comprehensive experiments were conducted to compare the proposed architecture with two established models: Conditional Tabular GAN (CTGAN) and Tabular Variational AutoEncoder (TVAE). The evaluation utilized metrics such as the Kolmogorov-Smirnov (KS) test for continuous variables, the Jaccard coefficient for categorical variables, and pairwise correlation analyses. Results indicate that the proposed approach attains a mean KS statistic of 0.3900, demonstrating strong overall performance that outperforms CTGAN (0.4803) and is comparable to TVAE (0.3858). Notably, our approach shows lowest KS statistics for key continuous features, such as total cholesterol (KS = 0.0779), weight (KS = 0.0861), and diastolic blood pressure (KS = 0.0957), indicating its effectiveness in closely replicating real data distributions. Additionally, it achieved a Jaccard coefficient of 1.00 for eight out of eleven categorical variables, effectively preserving categorical distributions. These findings indicate that the proposed architecture captures both distributions and dependencies, providing a robust solution in supporting mobile personalized cardiovascular disease prevention systems.
合成表格数据的生成已成为各个领域中的一项关键任务,尤其是在医疗保健领域,数据隐私问题限制了用于研究和分析的真实数据集的可用性。本文提出了一种增强的条件生成对抗网络(GAN)架构,旨在生成高质量的合成表格数据,重点关注包含混合数据类型和复杂特征关系的心血管疾病数据集。所提出的架构采用专门的子网分别处理连续变量和分类变量,利用诸如连续属性的高斯混合模型(GMM)参数和分类特征的嵌入层等元数据。通过整合这些专门的路径,生成器生成的合成样本紧密模仿真实数据的统计特性。进行了全面的实验,将所提出的架构与两个已建立的模型进行比较:条件表格GAN(CTGAN)和表格变分自编码器(TVAE)。评估使用了诸如连续变量的Kolmogorov-Smirnov(KS)检验、分类变量的Jaccard系数以及成对相关性分析等指标。结果表明,所提出的方法获得的平均KS统计量为0.3900,展示了强大的整体性能,优于CTGAN(0.4803)且与TVAE(0.3858)相当。值得注意的是,我们的方法在关键连续特征(如总胆固醇(KS = 0.0779)、体重(KS = 0.0861)和舒张压(KS = 0.0957))上显示出最低的KS统计量,表明其在紧密复制真实数据分布方面的有效性。此外,它在11个分类变量中的8个上实现了Jaccard系数为1.00,有效地保留了分类分布。这些发现表明,所提出的架构捕获了分布和依赖性,为支持移动个性化心血管疾病预防系统提供了一个强大的解决方案。