Gutsche Annika, Salameh Pascale, Jahandideh Samad S, Roodsaz Mehran, Kutan Serkan, Salehzadeh-Yazdi Ali, Kocatürk Emek, Gregoriou Stamatios, Thomsen Simon F, Kulthanan Kanokvalai, Tuchinda Papapit, Dissemond Joachim, Kasperska-Zajac Alicja, Zajac Magdalena, Zamłyński Mateusz, van Doorn Martijn, Parisi Claudio A S, Peter Jonny G, Day Cascia, McDougall Cathryn, Makris Michael, Fomina Daria, Kovalkova Elena, Streliaev Nikolai, Andrenova Gerelma, Lebedkina Marina, Khoskhkui Maryam, Aliabadi Mehraneh M, Bauer Andrea, Kiefer Lea, Muñoz Melba, Weller Karsten, Kolkhir Pavel, Metz Martin
Institute of Allergology, Charité-Universitätsmedizin Berlin, Corporate Member of Freie Universität Berlin and Humboldt-Universität zu Berlin, Berlin, Germany.
Fraunhofer Institute for Translational Medicine and Pharmacology ITMP, Immunology and Allergology, Berlin, Germany.
Clin Transl Allergy. 2025 Aug;15(8):e70087. doi: 10.1002/clt2.70087.
Robust data are essential for clinical and epidemiological research, yet in chronic spontaneous urticaria (CSU), certain patient groups, such as the elderly or comorbid patients, are often underrepresented. In clinical trials, strict inclusion and exclusion criteria frequently limit recruitment, making it difficult to achieve sufficient statistical power. Similarly, real-world observational studies may lack sufficient sample sizes for robust analysis. To address these limitations, we generated synthetic patient data that reflect these groups' clinical characteristics and variability. This approach enables more comprehensive analyses, facilitates hypothesis testing in otherwise inaccessible populations, and supports the generation of evidence where traditional data sources are insufficient.
A tree-based decision model was applied to generate synthetic data based on an existing set of real-world data (RWD) from the Chronic Urticaria Registry (CURE). Descriptive characteristics and association strength between relevant RWD variables and their synthetic counterparts were analyzed as indicators of replication accuracy, providing insight into how closely the synthetic data aligns with the RWD. Finally, we determined the minimum sample size required to generate high-quality synthetic data.
The algorithm produced extensive synthetic data records, closely mirroring patient demographics and disease clinical characteristics. Smaller subgroups of the data were equally replicated and followed the same distribution as RWD. Known associations and correlations between disease-specific factors (disease control) and risk factors (age) yielded similar results, with no significant difference (p > 0.05). The lowest threshold at which synthetic data could be generated while maintaining high accuracy in RWD was identified to be 25%, enabling a fourfold increase in the synthetic population.
Synthetic data could replicate RWD with reasonable accuracy for patients with CSU down to 25% of the original population size. This method has the potential to extend small patient subgroups in clinical and epidemiological research.
可靠的数据对于临床和流行病学研究至关重要,但在慢性自发性荨麻疹(CSU)中,某些患者群体,如老年人或合并症患者,在研究中往往代表性不足。在临床试验中,严格的纳入和排除标准常常限制了招募,难以获得足够的统计效力。同样,真实世界的观察性研究可能缺乏足够的样本量进行有力分析。为解决这些局限性,我们生成了反映这些群体临床特征和变异性的合成患者数据。这种方法能够进行更全面的分析,便于在其他难以触及的人群中进行假设检验,并在传统数据来源不足时支持证据的生成。
应用基于树的决策模型,根据慢性荨麻疹登记处(CURE)现有的一组真实世界数据(RWD)生成合成数据。分析相关RWD变量与其合成对应变量之间的描述性特征和关联强度,作为复制准确性的指标,以深入了解合成数据与RWD的匹配程度。最后,我们确定了生成高质量合成数据所需的最小样本量。
该算法生成了大量的合成数据记录,紧密反映了患者人口统计学和疾病临床特征。数据的较小亚组也得到了同等复制,并遵循与RWD相同的分布。疾病特异性因素(疾病控制)和危险因素(年龄)之间已知的关联和相关性产生了相似的结果,无显著差异(p>0.05)。在保持RWD高精度的同时能够生成合成数据的最低阈值被确定为25%,这使得合成人群增加了四倍。
对于CSU患者,合成数据能够以合理的准确性复制RWD,最低可至原始人群规模的25%。这种方法有潜力在临床和流行病学研究中扩展小型患者亚组。