人工智能驱动的合成数据生成以加速肝病学研究：器官共享联合网络（UNOS）数据库研究

AI-driven synthetic data generation for accelerating hepatology research: A study of the United Network for Organ Sharing (UNOS) database.

作者信息

Ahn Joseph C, Noh Yung-Kyun, Hu Mingzhao, Shen Xiaotong, Simonetto Douglas A, Kamath Patrick S, Loomba Rohit, Shah Vijay H

机构信息

Division of Gastroenterology and Hepatology, Department of Internal Medicine, Mayo Clinic, Rochester, Minnesota, USA.

Department of Computer Science, Hanyang University, Seoul, South Korea.

出版信息

Hepatology. 2025 Mar 11. doi: 10.1097/HEP.0000000000001299.

DOI:10.1097/HEP.0000000000001299

PMID:40067682

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12353439/

Abstract

BACKGROUND AND AIMS

Clinical hepatology research often faces limited data availability, underrepresentation of minority groups, and complex data-sharing regulations. Synthetic data-artificially generated patient records designed to mirror real-world distributions-offers a potential solution. We hypothesized that diffusion models, a state-of-the-art generative technique, could produce synthetic liver transplant waitlist data from the United Network for Organ Sharing database that maintains statistical fidelity, replicates clinical correlations and survival patterns, and ensures robust privacy protection.

APPROACH AND RESULTS

Diffusion models were used to generate synthetic patient cohorts mirroring the United Network for Organ Sharing liver transplant waitlist database between the years 2019 and 2023. Statistical fidelity was assessed using maximum mean discrepancy (MMD) and Wasserstein distance, correlation analysis, and variable-level metrics. Clinical utility was evaluated by comparing transplant-free survival via Kaplan-Meier curves and the MELD score performance. Privacy was quantified using the Distance to Closest Record (DCR) and attribute disclosure risk assessments.The synthetic dataset was nearly indistinguishable from the original dataset (MMD=0.002, standardized Wasserstein distance <1.0), preserving clinically relevant correlations and survival patterns as evidenced by similar median survival times (110 vs. 101 days) and 5-year survival rates (22.2% vs. 22.8%). MELD-based 90-day mortality prediction was maintained (original AUC=0.839 vs. synthetic AUC=0.844). Privacy metrics indicated no identifiable patient matches, and mean DCR values ensured that synthetic individuals were not direct replicas of real patients.

CONCLUSION

Artificial intelligence-generated synthetic data derived from diffusion models can faithfully replicate complex hepatology datasets, maintain key clinical signals, and ensure strong privacy safeguards. This approach can help address data scarcity, enhance model generalizability, foster multi-institutional collaboration, and accelerate progress in hepatology research.

摘要

背景与目的

临床肝脏病学研究常常面临数据可得性有限、少数群体代表性不足以及复杂的数据共享规定等问题。合成数据——旨在反映真实世界分布的人工生成的患者记录——提供了一种潜在的解决方案。我们假设，扩散模型作为一种先进的生成技术，可以从器官共享联合网络数据库中生成合成肝移植等待名单数据，该数据保持统计保真度、复制临床相关性和生存模式，并确保强大的隐私保护。

方法与结果

使用扩散模型生成了反映2019年至2023年器官共享联合网络肝移植等待名单数据库的合成患者队列。使用最大均值差异（MMD）和瓦瑟斯坦距离、相关性分析以及变量水平指标评估统计保真度。通过卡普兰-迈耶曲线比较无移植生存率和终末期肝病模型（MELD）评分表现来评估临床效用。使用到最近记录的距离（DCR）和属性披露风险评估来量化隐私。合成数据集与原始数据集几乎无法区分（MMD = 0.002，标准化瓦瑟斯坦距离<1.0），保留了临床相关的相关性和生存模式，相似的中位生存时间（110天对101天）和5年生存率（22.2%对22.8%）证明了这一点。基于MELD的90天死亡率预测得以维持（原始曲线下面积[AUC]=0.839对合成AUC = 0.844）。隐私指标表明没有可识别的患者匹配，平均DCR值确保合成个体不是真实患者的直接复制品。