Miletic Marko, Sariyar Murat
Institute for Optimisation and Data Analysis (IODA), Bern University of Applied Sciences, Biel, Switzerland.
JMIR AI. 2025 Mar 20;4:e65729. doi: 10.2196/65729.
Recent advancements in Generative Adversarial Networks and large language models (LLMs) have significantly advanced the synthesis and augmentation of medical data. These and other deep learning-based methods offer promising potential for generating high-quality, realistic datasets crucial for improving machine learning applications in health care, particularly in contexts where data privacy and availability are limiting factors. However, challenges remain in accurately capturing the complex associations inherent in medical datasets.
This study evaluates the effectiveness of various Synthetic Data Generation (SDG) methods in replicating the correlation structures inherent in real medical datasets. In addition, it examines their performance in downstream tasks using Random Forests (RFs) as the benchmark model. To provide a comprehensive analysis, alternative models such as eXtreme Gradient Boosting and Gated Additive Tree Ensembles are also considered. We compare the following SDG approaches: Synthetic Populations in R (synthpop), copula, copulagan, Conditional Tabular Generative Adversarial Network (ctgan), tabular variational autoencoder (tvae), and tabula for LLMs.
We evaluated synthetic data generation methods using both real-world and simulated datasets. Simulated data consist of 10 Gaussian variables and one binary target variable with varying correlation structures, generated via Cholesky decomposition. Real-world datasets include the body performance dataset with 13,393 samples for fitness classification, the Wisconsin Breast Cancer dataset with 569 samples for tumor diagnosis, and the diabetes dataset with 768 samples for diabetes prediction. Data quality is evaluated by comparing correlation matrices, the propensity score mean-squared error (pMSE) for general utility, and F-scores for downstream tasks as a specific utility metric, using training on synthetic data and testing on real data.
Our simulation study, supplemented with real-world data analyses, shows that the statistical methods copula and synthpop consistently outperform deep learning approaches across various sample sizes and correlation complexities, with synthpop being the most effective. Deep learning methods, including large LLMs, show mixed performance, particularly with smaller datasets or limited training epochs. LLMs often struggle to replicate numerical dependencies effectively. In contrast, methods like tvae with 10,000 epochs perform comparably well. On the body performance dataset, copulagan achieves the best performance in terms of pMSE. The results also highlight that model utility depends more on the relative correlations between features and the target variable than on the absolute magnitude of correlation matrix differences.
Statistical methods, particularly synthpop, demonstrate superior robustness and utility preservation for synthetic tabular data compared with deep learning approaches. Copula methods show potential but face limitations with integer variables. Deep Learning methods underperform in this context. Overall, these findings underscore the dominance of statistical methods for synthetic data generation for tabular data, while highlighting the niche potential of deep learning approaches for highly complex datasets, provided adequate resources and tuning.
生成对抗网络和大语言模型(LLMs)的最新进展显著推动了医学数据的合成与增强。这些以及其他基于深度学习的方法为生成高质量、逼真的数据集提供了有前景的潜力,这些数据集对于改善医疗保健中的机器学习应用至关重要,特别是在数据隐私和可用性成为限制因素的情况下。然而,在准确捕捉医学数据集中固有的复杂关联方面仍存在挑战。
本研究评估了各种合成数据生成(SDG)方法在复制真实医学数据集中固有相关结构方面的有效性。此外,以随机森林(RFs)作为基准模型,研究了它们在下游任务中的性能。为了进行全面分析,还考虑了诸如极端梯度提升和门控加法树集成等替代模型。我们比较了以下SDG方法:R语言中的合成总体(synthpop)、copula、copulagan、条件表格生成对抗网络(ctgan)、表格变分自编码器(tvae)以及用于大语言模型的tabula。
我们使用真实世界和模拟数据集评估合成数据生成方法。模拟数据由10个高斯变量和1个具有不同相关结构的二元目标变量组成,通过乔列斯基分解生成。真实世界数据集包括用于健康分类的有13393个样本的身体性能数据集、用于肿瘤诊断的有569个样本的威斯康星乳腺癌数据集以及用于糖尿病预测的有768个样本的糖尿病数据集。通过比较相关矩阵、用于一般效用的倾向得分均方误差(pMSE)以及作为特定效用指标的下游任务的F分数来评估数据质量,其中使用合成数据进行训练并在真实数据上进行测试。
我们的模拟研究以及真实世界数据分析表明,统计方法copula和synthpop在各种样本大小和相关复杂性方面始终优于深度学习方法,其中synthpop最为有效。深度学习方法,包括大型大语言模型,表现参差不齐,特别是在较小数据集或训练轮次有限的情况下。大语言模型通常难以有效复制数值依赖性。相比之下,经过10000个轮次训练的tvae等方法表现相当不错。在身体性能数据集上,copulagan在pMSE方面取得了最佳性能。结果还突出表明,模型效用更多地取决于特征与目标变量之间的相对相关性,而不是相关矩阵差异的绝对大小。
与深度学习方法相比,统计方法,特别是synthpop,在合成表格数据方面表现出卓越的稳健性和效用保持性。Copula方法显示出潜力,但在处理整数变量时存在局限性。在这种情况下,深度学习方法表现不佳。总体而言,这些发现强调了统计方法在表格数据合成数据生成中的主导地位,同时突出了深度学习方法在有足够资源和调整的情况下对高度复杂数据集的利基潜力。