Kang Ha Ye Jin, Batbaatar Erdenebileg, Choi Dong-Woo, Choi Kui Son, Ko Minsam, Ryu Kwang Sun
Department of Applied Artificial Intelligence, Hanyang University, Ansan, Republic of Korea.
Department of Cancer AI & Digital Health, Graduate School of Cancer Science and Policy, National Cancer Center, Gyeonggi-do, Republic of Korea.
JMIR Med Inform. 2023 Nov 24;11:e47859. doi: 10.2196/47859.
Synthetic data generation (SDG) based on generative adversarial networks (GANs) is used in health care, but research on preserving data with logical relationships with synthetic tabular data (STD) remains challenging. Filtering methods for SDG can lead to the loss of important information.
This study proposed a divide-and-conquer (DC) method to generate STD based on the GAN algorithm, while preserving data with logical relationships.
The proposed method was evaluated on data from the Korea Association for Lung Cancer Registry (KALC-R) and 2 benchmark data sets (breast cancer and diabetes). The DC-based SDG strategy comprises 3 steps: (1) We used 2 different partitioning methods (the class-specific criterion distinguished between survival and death groups, while the Cramer V criterion identified the highest correlation between columns in the original data); (2) the entire data set was divided into a number of subsets, which were then used as input for the conditional tabular generative adversarial network and the copula generative adversarial network to generate synthetic data; and (3) the generated synthetic data were consolidated into a single entity. For validation, we compared DC-based SDG and conditional sampling (CS)-based SDG through the performances of machine learning models. In addition, we generated imbalanced and balanced synthetic data for each of the 3 data sets and compared their performance using 4 classifiers: decision tree (DT), random forest (RF), Extreme Gradient Boosting (XGBoost), and light gradient-boosting machine (LGBM) models.
The synthetic data of the 3 diseases (non-small cell lung cancer [NSCLC], breast cancer, and diabetes) generated by our proposed model outperformed the 4 classifiers (DT, RF, XGBoost, and LGBM). The CS- versus DC-based model performances were compared using the mean area under the curve (SD) values: 74.87 (SD 0.77) versus 63.87 (SD 2.02) for NSCLC, 73.31 (SD 1.11) versus 67.96 (SD 2.15) for breast cancer, and 61.57 (SD 0.09) versus 60.08 (SD 0.17) for diabetes (DT); 85.61 (SD 0.29) versus 79.01 (SD 1.20) for NSCLC, 78.05 (SD 1.59) versus 73.48 (SD 4.73) for breast cancer, and 59.98 (SD 0.24) versus 58.55 (SD 0.17) for diabetes (RF); 85.20 (SD 0.82) versus 76.42 (SD 0.93) for NSCLC, 77.86 (SD 2.27) versus 68.32 (SD 2.37) for breast cancer, and 60.18 (SD 0.20) versus 58.98 (SD 0.29) for diabetes (XGBoost); and 85.14 (SD 0.77) versus 77.62 (SD 1.85) for NSCLC, 78.16 (SD 1.52) versus 70.02 (SD 2.17) for breast cancer, and 61.75 (SD 0.13) versus 61.12 (SD 0.23) for diabetes (LGBM). In addition, we found that balanced synthetic data performed better.
This study is the first attempt to generate and validate STD based on a DC approach and shows improved performance using STD. The necessity for balanced SDG was also demonstrated.
基于生成对抗网络(GAN)的合成数据生成(SDG)已应用于医疗保健领域,但在保留与合成表格数据(STD)具有逻辑关系的数据方面的研究仍具有挑战性。用于SDG的过滤方法可能会导致重要信息的丢失。
本研究提出一种分治法(DC),基于GAN算法生成STD,同时保留具有逻辑关系的数据。
在韩国肺癌登记协会(KALC-R)的数据以及2个基准数据集(乳腺癌和糖尿病)上对所提出的方法进行评估。基于DC的SDG策略包括3个步骤:(1)我们使用2种不同的划分方法(特定类别标准区分生存组和死亡组,而克拉默V准则确定原始数据中各列之间的最高相关性);(2)将整个数据集划分为多个子集,然后将其用作条件表格生成对抗网络和耦合生成对抗网络的输入以生成合成数据;(3)将生成的合成数据整合为一个单一实体。为进行验证,我们通过机器学习模型的性能比较了基于DC的SDG和基于条件采样(CS)的SDG。此外,我们为3个数据集中的每个数据集生成了不平衡和平衡的合成数据,并使用4种分类器进行比较:决策树(DT)、随机森林(RF)、极端梯度提升(XGBoost)和轻量级梯度提升机(LGBM)模型。
我们提出的模型生成的3种疾病(非小细胞肺癌[NSCLC]、乳腺癌和糖尿病)的合成数据优于4种分类器(DT、RF、XGBoost和LGBM)。使用曲线下平均面积(SD)值比较基于CS和基于DC的模型性能:NSCLC分别为74.87(SD 0.77)和63.87(SD 2.02),乳腺癌分别为73.31(SD 1.11)和67.96(SD 2.15),糖尿病(DT)分别为61.57(SD 0.09)和60.08(SD 0.17);NSCLC(RF)分别为85.61(SD 0.29)和79.01(SD 1.20),乳腺癌分别为78.05(SD 1.59)和73.48(SD 4.73),糖尿病分别为59.98(SD 0.24)和58.55(SD 0.17);NSCLC(XGBoost)分别为85.20(SD 0.82)和76.42(SD 0.93),乳腺癌分别为77.86(SD 2.27)和68.32(SD 2.37),糖尿病分别为60.18(SD 0.20)和58.98(SD 0.29);NSCLC(LGBM)分别为85.14(SD 0.77)和77.62(SD 1.85),乳腺癌分别为78.16(SD 1.52)和70.02(SD 2.17),糖尿病分别为61.75(SD 0.13)和61.12(SD 0.23)。此外,我们发现平衡的合成数据表现更好。
本研究首次尝试基于DC方法生成并验证STD,并展示了使用STD的改进性能。还证明了平衡SDG的必要性。