Hogenboom Joshi, Lobo Gomes Aiara, Dekker Andre, Van Der Graaf Winette, Husson Olga, Wee Leonard
Department of Radiation Oncology (Maastro), GROW School for Oncology and Reproduction, Maastricht University Medical Centre+, Maastricht, the Netherlands.
Department of Medical Oncology, Netherlands Cancer Institute, Amsterdam, the Netherlands.
JCO Clin Cancer Inform. 2024 Dec;8:e2400056. doi: 10.1200/CCI.24.00056. Epub 2024 Dec 3.
Research on rare diseases and atypical health care demographics is often slowed by high interparticipant heterogeneity and overall scarcity of data. Synthetic data (SD) have been proposed as means for data sharing, enlargement, and diversification, by artificially generating real phenomena while obscuring the real patient data. The utility of SD is actively scrutinized in health care research, but the role of sample size for actionability of SD is insufficiently explored. We aim to understand the interplay of actionability and sample size by generating SD sets of varying sizes from gradually diminishing amounts of real individuals' data. We evaluate the actionability of SD in a highly heterogeneous and rare demographic: adolescents and young adults (AYAs) with cancer.
A population-based cross-sectional cohort study of 3,735 AYAs was subsampled at random to produce 13 training data sets of varying sample sizes. We studied four distinct generator architectures built on the open-source Synthetic Data Vault library. Each architecture was used to generate SD of varying sizes on the basis of each aforementioned training subsets. SD actionability was assessed by comparing the resulting SD with their respective real data against three metrics-veracity, utility, and privacy concealment.
All examined generator architectures yielded actionable data when generating SD with sizes similar to the real data. Large SD sample size increased veracity but generally increased privacy risks. Using fewer training participants led to faster convergence in veracity, but partially exacerbated privacy concealment issues.
SD is a potentially promising option for data sharing and data augmentation, yet sample size plays a significant role in its actionability. SD generation should go hand-in-hand with consistent scrutiny, and sample size should be carefully considered in this process.
罕见病和非典型医疗保健人群的研究常常因参与者之间的高度异质性和数据的总体稀缺性而放缓。合成数据(SD)已被提议作为数据共享、扩充和多样化的手段,通过人工生成真实现象,同时掩盖真实患者数据。合成数据在医疗保健研究中的效用正受到积极审查,但对于合成数据可操作性而言样本量的作用尚未得到充分探索。我们旨在通过从逐渐减少的真实个体数据量中生成不同大小的合成数据集,来了解可操作性与样本量之间的相互作用。我们在一个高度异质且罕见的人群中评估合成数据的可操作性:患有癌症的青少年和青年(AYA)。
对一项基于人群的包含3735名青少年和青年的横断面队列研究进行随机抽样,以产生13个不同样本量的训练数据集。我们研究了基于开源合成数据保险库库构建的四种不同生成器架构。每种架构都用于根据上述每个训练子集生成不同大小的合成数据。通过将生成的合成数据与其各自的真实数据在准确性、效用和隐私隐藏这三个指标上进行比较,来评估合成数据的可操作性。
当生成大小与真实数据相似的合成数据时,所有检查的生成器架构都产生了可操作的数据。大的合成数据样本量提高了准确性,但通常增加了隐私风险。使用较少的训练参与者会导致准确性更快收敛,但部分加剧了隐私隐藏问题。
合成数据是数据共享和数据扩充的一个潜在有前景的选择,然而样本量在其可操作性中起着重要作用。合成数据生成应与持续审查同步进行,并且在此过程中应仔细考虑样本量。