Kang Ha Ye Jin, Ko Minsam, Ryu Kwang Sun
Department of Applied Artificial Intelligence, Hanyang University, Seoul, Republic of Korea.
Department of Public Health & AI, Graduate School of Cancer Science and Policy, National Cancer Center, Goyang, Republic of Korea.
Sci Rep. 2025 Mar 25;15(1):10254. doi: 10.1038/s41598-025-93077-3.
In healthcare, the most common type of data is tabular data, which holds high significance and potential in the field of medical AI. However, privacy concerns have hindered their widespread use. Despite the emergence of synthetic data as a viable solution, the generation of healthcare tabular data (HTD) is complex owing to the extensive interdependencies between the variables within each record that incorporate diverse clinical characteristics, including sensitive information. To overcome these issues, this study proposed a tabular transformer generative adversarial network (TT-GAN) to generate synthetic data that can effectively consider the relationships between variables potentially present in the HTD dataset. Transformers can consider the relationships between the columns in each record using a multi-attention mechanism. In addition, to address the potential risk of restoring sensitive data in patient information, a Transformer was employed in a generative adversarial network (GAN) architecture, to ensure an implicit-based algorithm. To consider the heterogeneous characteristics of the continuous variables in the HTD dataset, the discretization and converter methodology were applied. The experimental results confirmed the superior performance of the TT-GAN than the Conditional Tabular GAN (CTGAN) and copula GAN. Discretization and converters were proven to be effective using our proposed Transformer algorithm. However, the application of the same methodology to Transformer-based models without discretization and converters exhibited a significantly inferior performance. The CTGAN and copula GAN indicated minimal effectiveness with discretization and converter methodologies. Thus, the TT-GAN exhibited considerable potential in healthcare, demonstrating its ability to generate artificial data that closely resembled real healthcare datasets. The ability of the algorithm to handle different types of mixed variables efficiently, including polynomial, discrete, and continuous variables, demonstrated its versatility and practicality in health care research and data synthesis.
在医疗保健领域,最常见的数据类型是表格数据,其在医学人工智能领域具有高度的重要性和潜力。然而,隐私问题阻碍了它们的广泛应用。尽管合成数据作为一种可行的解决方案已经出现,但由于每条记录中的变量之间存在广泛的相互依赖关系,这些变量包含了包括敏感信息在内的各种临床特征,因此医疗保健表格数据(HTD)的生成非常复杂。为了克服这些问题,本研究提出了一种表格变压器生成对抗网络(TT-GAN),以生成能够有效考虑HTD数据集中潜在变量之间关系的合成数据。变压器可以使用多注意力机制来考虑每条记录中各列之间的关系。此外,为了解决恢复患者信息中敏感数据的潜在风险,在生成对抗网络(GAN)架构中采用了变压器,以确保基于隐式的算法。为了考虑HTD数据集中连续变量的异质性特征,应用了离散化和转换器方法。实验结果证实了TT-GAN比条件表格GAN(CTGAN)和copula GAN具有更优的性能。使用我们提出的变压器算法,离散化和转换器被证明是有效的。然而,将相同的方法应用于没有离散化和转换器的基于变压器的模型时,性能明显较差。CTGAN和copula GAN在离散化和转换器方法下效果甚微。因此,TT-GAN在医疗保健领域展现出了巨大的潜力,证明了其生成与真实医疗保健数据集非常相似的人工数据的能力。该算法有效处理不同类型混合变量(包括多项式、离散和连续变量)的能力,证明了其在医疗保健研究和数据合成中的通用性和实用性。