Eckardt Jan-Niklas, Hahn Waldemar, Röllig Christoph, Stasik Sebastian, Platzbecker Uwe, Müller-Tidow Carsten, Serve Hubert, Baldus Claudia D, Schliemann Christoph, Schäfer-Eckart Kerstin, Hanoun Maher, Kaufmann Martin, Burchert Andreas, Thiede Christian, Schetelig Johannes, Sedlmayr Martin, Bornhäuser Martin, Wolfien Markus, Middeke Jan Moritz
Department of Internal Medicine I, University Hospital Carl Gustav Carus, Technical University Dresden, Dresden, Germany.
Else Kröner Fresenius Center for Digital Health, Technical University Dresden, Dresden, Germany.
NPJ Digit Med. 2024 Mar 20;7(1):76. doi: 10.1038/s41746-024-01076-x.
Clinical research relies on high-quality patient data, however, obtaining big data sets is costly and access to existing data is often hindered by privacy and regulatory concerns. Synthetic data generation holds the promise of effectively bypassing these boundaries allowing for simplified data accessibility and the prospect of synthetic control cohorts. We employed two different methodologies of generative artificial intelligence - CTAB-GAN+ and normalizing flows (NFlow) - to synthesize patient data derived from 1606 patients with acute myeloid leukemia, a heterogeneous hematological malignancy, that were treated within four multicenter clinical trials. Both generative models accurately captured distributions of demographic, laboratory, molecular and cytogenetic variables, as well as patient outcomes yielding high performance scores regarding fidelity and usability of both synthetic cohorts (n = 1606 each). Survival analysis demonstrated close resemblance of survival curves between original and synthetic cohorts. Inter-variable relationships were preserved in univariable outcome analysis enabling explorative analysis in our synthetic data. Additionally, training sample privacy is safeguarded mitigating possible patient re-identification, which we quantified using Hamming distances. We provide not only a proof-of-concept for synthetic data generation in multimodal clinical data for rare diseases, but also full public access to synthetic data sets to foster further research.
临床研究依赖于高质量的患者数据,然而,获取大数据集成本高昂,且对现有数据的访问常常受到隐私和监管问题的阻碍。合成数据生成有望有效突破这些限制,实现更便捷的数据获取,并带来合成对照队列的前景。我们采用了两种不同的生成式人工智能方法——CTAB-GAN+和归一化流(NFlow)——来合成来自1606例急性髓系白血病患者的数据,急性髓系白血病是一种异质性血液系统恶性肿瘤,这些数据来自四项多中心临床试验中的患者治疗信息。两种生成模型都准确地捕捉了人口统计学、实验室、分子和细胞遗传学变量的分布,以及患者的预后情况,两个合成队列(各n = 1606)在保真度和可用性方面均获得了高分。生存分析表明,原始队列和合成队列的生存曲线非常相似。单变量结果分析中保留了变量间的关系,从而能够对我们的合成数据进行探索性分析。此外,训练样本的隐私得到了保护,减少了患者被重新识别的可能性,我们使用汉明距离对其进行了量化。我们不仅为罕见病多模态临床数据的合成数据生成提供了概念验证,还提供了合成数据集的完全公共访问权限,以促进进一步的研究。