• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

基于生成对抗网络的医疗保健合成表格数据:采用分治策略进行生成与验证

Synthetic Tabular Data Based on Generative Adversarial Networks in Health Care: Generation and Validation Using the Divide-and-Conquer Strategy.

作者信息

Kang Ha Ye Jin, Batbaatar Erdenebileg, Choi Dong-Woo, Choi Kui Son, Ko Minsam, Ryu Kwang Sun

机构信息

Department of Applied Artificial Intelligence, Hanyang University, Ansan, Republic of Korea.

Department of Cancer AI & Digital Health, Graduate School of Cancer Science and Policy, National Cancer Center, Gyeonggi-do, Republic of Korea.

出版信息

JMIR Med Inform. 2023 Nov 24;11:e47859. doi: 10.2196/47859.

DOI:10.2196/47859
PMID:37999942
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC10709788/
Abstract

BACKGROUND

Synthetic data generation (SDG) based on generative adversarial networks (GANs) is used in health care, but research on preserving data with logical relationships with synthetic tabular data (STD) remains challenging. Filtering methods for SDG can lead to the loss of important information.

OBJECTIVE

This study proposed a divide-and-conquer (DC) method to generate STD based on the GAN algorithm, while preserving data with logical relationships.

METHODS

The proposed method was evaluated on data from the Korea Association for Lung Cancer Registry (KALC-R) and 2 benchmark data sets (breast cancer and diabetes). The DC-based SDG strategy comprises 3 steps: (1) We used 2 different partitioning methods (the class-specific criterion distinguished between survival and death groups, while the Cramer V criterion identified the highest correlation between columns in the original data); (2) the entire data set was divided into a number of subsets, which were then used as input for the conditional tabular generative adversarial network and the copula generative adversarial network to generate synthetic data; and (3) the generated synthetic data were consolidated into a single entity. For validation, we compared DC-based SDG and conditional sampling (CS)-based SDG through the performances of machine learning models. In addition, we generated imbalanced and balanced synthetic data for each of the 3 data sets and compared their performance using 4 classifiers: decision tree (DT), random forest (RF), Extreme Gradient Boosting (XGBoost), and light gradient-boosting machine (LGBM) models.

RESULTS

The synthetic data of the 3 diseases (non-small cell lung cancer [NSCLC], breast cancer, and diabetes) generated by our proposed model outperformed the 4 classifiers (DT, RF, XGBoost, and LGBM). The CS- versus DC-based model performances were compared using the mean area under the curve (SD) values: 74.87 (SD 0.77) versus 63.87 (SD 2.02) for NSCLC, 73.31 (SD 1.11) versus 67.96 (SD 2.15) for breast cancer, and 61.57 (SD 0.09) versus 60.08 (SD 0.17) for diabetes (DT); 85.61 (SD 0.29) versus 79.01 (SD 1.20) for NSCLC, 78.05 (SD 1.59) versus 73.48 (SD 4.73) for breast cancer, and 59.98 (SD 0.24) versus 58.55 (SD 0.17) for diabetes (RF); 85.20 (SD 0.82) versus 76.42 (SD 0.93) for NSCLC, 77.86 (SD 2.27) versus 68.32 (SD 2.37) for breast cancer, and 60.18 (SD 0.20) versus 58.98 (SD 0.29) for diabetes (XGBoost); and 85.14 (SD 0.77) versus 77.62 (SD 1.85) for NSCLC, 78.16 (SD 1.52) versus 70.02 (SD 2.17) for breast cancer, and 61.75 (SD 0.13) versus 61.12 (SD 0.23) for diabetes (LGBM). In addition, we found that balanced synthetic data performed better.

CONCLUSIONS

This study is the first attempt to generate and validate STD based on a DC approach and shows improved performance using STD. The necessity for balanced SDG was also demonstrated.

摘要

背景

基于生成对抗网络(GAN)的合成数据生成(SDG)已应用于医疗保健领域,但在保留与合成表格数据(STD)具有逻辑关系的数据方面的研究仍具有挑战性。用于SDG的过滤方法可能会导致重要信息的丢失。

目的

本研究提出一种分治法(DC),基于GAN算法生成STD,同时保留具有逻辑关系的数据。

方法

在韩国肺癌登记协会(KALC-R)的数据以及2个基准数据集(乳腺癌和糖尿病)上对所提出的方法进行评估。基于DC的SDG策略包括3个步骤:(1)我们使用2种不同的划分方法(特定类别标准区分生存组和死亡组,而克拉默V准则确定原始数据中各列之间的最高相关性);(2)将整个数据集划分为多个子集,然后将其用作条件表格生成对抗网络和耦合生成对抗网络的输入以生成合成数据;(3)将生成的合成数据整合为一个单一实体。为进行验证,我们通过机器学习模型的性能比较了基于DC的SDG和基于条件采样(CS)的SDG。此外,我们为3个数据集中的每个数据集生成了不平衡和平衡的合成数据,并使用4种分类器进行比较:决策树(DT)、随机森林(RF)、极端梯度提升(XGBoost)和轻量级梯度提升机(LGBM)模型。

结果

我们提出的模型生成的3种疾病(非小细胞肺癌[NSCLC]、乳腺癌和糖尿病)的合成数据优于4种分类器(DT、RF、XGBoost和LGBM)。使用曲线下平均面积(SD)值比较基于CS和基于DC的模型性能:NSCLC分别为74.87(SD 0.77)和63.87(SD 2.02),乳腺癌分别为73.31(SD 1.11)和67.96(SD 2.15),糖尿病(DT)分别为61.57(SD 0.09)和60.08(SD 0.17);NSCLC(RF)分别为85.61(SD 0.29)和79.01(SD 1.20),乳腺癌分别为78.05(SD 1.59)和73.48(SD 4.73),糖尿病分别为59.98(SD 0.24)和58.55(SD 0.17);NSCLC(XGBoost)分别为85.20(SD 0.82)和76.42(SD 0.93),乳腺癌分别为77.86(SD 2.27)和68.32(SD 2.37),糖尿病分别为60.18(SD 0.20)和58.98(SD 0.29);NSCLC(LGBM)分别为85.14(SD 0.77)和77.62(SD 1.85),乳腺癌分别为78.16(SD 1.52)和70.02(SD 2.17),糖尿病分别为61.75(SD 0.13)和61.12(SD 0.23)。此外,我们发现平衡的合成数据表现更好。

结论

本研究首次尝试基于DC方法生成并验证STD,并展示了使用STD的改进性能。还证明了平衡SDG的必要性。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/b1cb/10709788/78a64bd3ffe7/medinform_v11i1e47859_fig2.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/b1cb/10709788/e8c2464da41d/medinform_v11i1e47859_fig1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/b1cb/10709788/78a64bd3ffe7/medinform_v11i1e47859_fig2.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/b1cb/10709788/e8c2464da41d/medinform_v11i1e47859_fig1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/b1cb/10709788/78a64bd3ffe7/medinform_v11i1e47859_fig2.jpg

相似文献

1
Synthetic Tabular Data Based on Generative Adversarial Networks in Health Care: Generation and Validation Using the Divide-and-Conquer Strategy.基于生成对抗网络的医疗保健合成表格数据:采用分治策略进行生成与验证
JMIR Med Inform. 2023 Nov 24;11:e47859. doi: 10.2196/47859.
2
Utility Metrics for Evaluating Synthetic Health Data Generation Methods: Validation Study.用于评估合成健康数据生成方法的效用指标:验证研究
JMIR Med Inform. 2022 Apr 7;10(4):e35734. doi: 10.2196/35734.
3
Utilization of Synthetic Near-Infrared Spectra via Generative Adversarial Network to Improve Wood Stiffness Prediction.利用生成对抗网络的合成近红外光谱提高木材硬度预测
Sensors (Basel). 2024 Mar 21;24(6):1992. doi: 10.3390/s24061992.
4
Generative adversarial network based synthetic data training model for lightweight convolutional neural networks.用于轻量级卷积神经网络的基于生成对抗网络的合成数据训练模型。
Multimed Tools Appl. 2023 May 20:1-23. doi: 10.1007/s11042-023-15747-6.
5
CTAB-GAN+: enhancing tabular data synthesis.CTAB-GAN+:增强表格数据合成
Front Big Data. 2024 Jan 8;6:1296508. doi: 10.3389/fdata.2023.1296508. eCollection 2023.
6
Crash injury severity prediction considering data imbalance: A Wasserstein generative adversarial network with gradient penalty approach.考虑数据不平衡的碰撞损伤严重程度预测:带梯度惩罚的 Wasserstein 生成对抗网络方法。
Accid Anal Prev. 2023 Nov;192:107271. doi: 10.1016/j.aap.2023.107271. Epub 2023 Aug 31.
7
Data Augmentation of a Corrosion Dataset for Defect Growth Prediction of Pipelines Using Conditional Tabular Generative Adversarial Networks.使用条件表格生成对抗网络对管道缺陷增长预测的腐蚀数据集进行数据增强
Materials (Basel). 2024 Mar 1;17(5):1142. doi: 10.3390/ma17051142.
8
CTCN: a novel credit card fraud detection method based on Conditional Tabular Generative Adversarial Networks and Temporal Convolutional Network.CTCN:一种基于条件表格生成对抗网络和时间卷积网络的新型信用卡欺诈检测方法。
PeerJ Comput Sci. 2023 Oct 10;9:e1634. doi: 10.7717/peerj-cs.1634. eCollection 2023.
9
A new imbalanced data oversampling method based on Bootstrap method and Wasserstein Generative Adversarial Network.一种基于自助法和瓦瑟斯坦生成对抗网络的新型不平衡数据过采样方法。
Math Biosci Eng. 2024 Feb 26;21(3):4309-4327. doi: 10.3934/mbe.2024190.
10
Generating synthetic clinical data that capture class imbalanced distributions with generative adversarial networks: Example using antiretroviral therapy for HIV.利用生成对抗网络生成具有类不平衡分布的合成临床数据:以 HIV 的抗逆转录病毒治疗为例。
J Biomed Inform. 2023 Aug;144:104436. doi: 10.1016/j.jbi.2023.104436. Epub 2023 Jul 13.

引用本文的文献

1
Synthetic Data in Healthcare and Drug Development: Definitions, Regulatory Frameworks, Issues.医疗保健与药物研发中的合成数据:定义、监管框架、问题
CPT Pharmacometrics Syst Pharmacol. 2025 May;14(5):840-852. doi: 10.1002/psp4.70021. Epub 2025 Apr 7.
2
Tabular transformer generative adversarial network for heterogeneous distribution in healthcare.用于医疗保健中异构分布的表格变压器生成对抗网络。
Sci Rep. 2025 Mar 25;15(1):10254. doi: 10.1038/s41598-025-93077-3.
3
How good is your synthetic data? SynthRO, a dashboard to evaluate and benchmark synthetic tabular data.

本文引用的文献

1
Synthetic data in health care: A narrative review.医疗保健中的合成数据:一篇叙述性综述。
PLOS Digit Health. 2023 Jan 6;2(1):e0000082. doi: 10.1371/journal.pdig.0000082. eCollection 2023 Jan.
2
Generative adversarial networks and synthetic patient data: current challenges and future perspectives.生成对抗网络与合成患者数据:当前挑战与未来展望
Future Healthc J. 2022 Jul;9(2):190-193. doi: 10.7861/fhj.2022-0013.
3
Five-Year Overall Survival and Prognostic Factors in Patients with Lung Cancer: Results from the Korean Association of Lung Cancer Registry (KALC-R) 2015.
你的合成数据有多好?SynthRO,一个用于评估和基准测试合成表格数据的仪表板。
BMC Med Inform Decis Mak. 2025 Feb 18;25(1):89. doi: 10.1186/s12911-024-02731-9.
4
Consolidated Reporting Guidelines for Prognostic and Diagnostic Machine Learning Models (CREMLS).预测和诊断机器学习模型的综合报告指南 (CREMLS)。
J Med Internet Res. 2024 May 2;26:e52508. doi: 10.2196/52508.
2015 年韩国肺癌登记协会(KALC-R):肺癌患者五年总体生存率及预后因素。
Cancer Res Treat. 2023 Jan;55(1):103-111. doi: 10.4143/crt.2022.264. Epub 2022 Jun 20.
4
Shifting machine learning for healthcare from development to deployment and from models to data.将医疗保健领域的机器学习从开发转移到部署,从模型转移到数据。
Nat Biomed Eng. 2022 Dec;6(12):1330-1345. doi: 10.1038/s41551-022-00898-y. Epub 2022 Jul 4.
5
Can synthetic data be a proxy for real clinical trial data? A validation study.合成数据能否替代真实的临床试验数据?一项验证性研究。
BMJ Open. 2021 Apr 16;11(4):e043497. doi: 10.1136/bmjopen-2020-043497.
6
A Semantic-Based Approach for Managing Healthcare Big Data: A Survey.基于语义的医疗保健大数据管理方法:调查。
J Healthc Eng. 2020 Nov 23;2020:8865808. doi: 10.1155/2020/8865808. eCollection 2020.
7
Generative Adversarial Networks and Its Applications in Biomedical Informatics.生成对抗网络及其在生物医学信息学中的应用。
Front Public Health. 2020 May 12;8:164. doi: 10.3389/fpubh.2020.00164. eCollection 2020.
8
A Review of Challenges and Opportunities in Machine Learning for Health.机器学习在健康领域的挑战与机遇综述。
AMIA Jt Summits Transl Sci Proc. 2020 May 30;2020:191-200. eCollection 2020.
9
Privacy-Preserving Generative Deep Neural Networks Support Clinical Data Sharing.隐私保护生成式深度神经网络支持临床数据共享。
Circ Cardiovasc Qual Outcomes. 2019 Jul;12(7):e005122. doi: 10.1161/CIRCOUTCOMES.118.005122. Epub 2019 Jul 9.
10
Report of the Korean Association of Lung Cancer Registry (KALC-R), 2014.2014 年韩国肺癌登记协会报告(KALC-R)。
Cancer Res Treat. 2019 Oct;51(4):1400-1410. doi: 10.4143/crt.2018.704. Epub 2019 Feb 25.