Bern University of Applied Sciences, Switzerland.
Stud Health Technol Inform. 2024 Aug 22;316:963-967. doi: 10.3233/SHTI240571.
Synthetic tabular health data plays a crucial role in healthcare research, addressing privacy regulations and the scarcity of publicly available datasets. This is essential for diagnostic and treatment advancements. Among the most promising models are transformer-based Large Language Models (LLMs) and Generative Adversarial Networks (GANs). In this paper, we compare LLM models of the Pythia LLM Scaling Suite with varying model sizes ranging from 14M to 1B, against a reference GAN model (CTGAN). The generated synthetic data are used to train random forest estimators for classification tasks to make predictions on the real-world data. Our findings indicate that as the number of parameters increases, LLM models outperform the reference GAN model. Even the smallest 14M parameter models perform comparably to GANs. Moreover, we observe a positive correlation between the size of the training dataset and model performance. We discuss implications, challenges, and considerations for the real-world usage of LLM models for synthetic tabular data generation.
合成表格健康数据在医疗保健研究中起着至关重要的作用,它可以解决隐私法规和公开可用数据集稀缺的问题。这对于诊断和治疗的进步至关重要。最有前途的模型之一是基于转换器的大型语言模型(LLM)和生成对抗网络(GAN)。在本文中,我们比较了 Pythia LLM 扩展套件中具有不同模型大小(从 14M 到 1B)的 LLM 模型与参考 GAN 模型(CTGAN)。生成的合成数据用于训练随机森林估计器进行分类任务,以便对真实世界的数据进行预测。我们的研究结果表明,随着参数数量的增加,LLM 模型的性能优于参考 GAN 模型。即使是最小的 14M 参数模型的性能也与 GAN 相当。此外,我们观察到训练数据集的大小与模型性能之间存在正相关关系。我们讨论了 LLM 模型在合成表格数据生成方面的实际应用的影响、挑战和考虑因素。