Suppr超能文献

一种在随机对照试验环境中生成逼真合成表格数据的框架。

A Framework for Generating Realistic Synthetic Tabular Data in a Randomized Controlled Trial Setting.

作者信息

Petrakos Niki Z, Moodie Erica E M, Savy Nicolas

机构信息

Department of Epidemiology, Biostatistics, and Occupational Health, McGill University, Québec, Canada.

Institut de Mathématiques de Toulouse; UMR5219, Université de Toulouse; CNRS, UT2J, Toulouse, France.

出版信息

Stat Med. 2025 Aug;44(18-19):e70227. doi: 10.1002/sim.70227.

Abstract

Generation of realistic synthetic data has garnered considerable attention in recent years, particularly in the health research domain due to its utility in, for instance, sharing data while protecting patient privacy or determining optimal clinical trial design. While much work has been concentrated on synthetic image generation, generation of realistic and complex synthetic tabular data of the type most commonly encountered in classic epidemiological or clinical studies is still lacking, especially with regard to generating data for randomized controlled trials (RCTs). There is no consensus regarding the best way to generate synthetic tabular RCT data such that the underlying multivariate data distribution is preserved. Motivated by an RCT in the treatment of Human Immunodeficiency Virus, we empirically compared the ability of several strategies and three generation techniques (two machine learning, the other a more classical statistical method) to faithfully reproduce realistic data. Our results suggest that using a sequential generation approach with an R-vine copula model to generate baseline variables, followed by a simple random treatment allocation to mimic the RCT setting, and subsequent regression models for variables post-treatment allocation (such as the trial outcome) is the most effective way to generate synthetic tabular RCT data that capture important and realistic features of the real data.

摘要

近年来,生成逼真的合成数据备受关注,尤其是在健康研究领域,因为它在例如保护患者隐私的同时共享数据或确定最佳临床试验设计等方面具有实用价值。虽然许多工作都集中在合成图像生成上,但仍然缺乏生成经典流行病学或临床研究中最常见类型的逼真且复杂的合成表格数据的方法,特别是在为随机对照试验(RCT)生成数据方面。对于生成合成表格RCT数据的最佳方法,即如何保留潜在的多变量数据分布,目前尚无共识。受一项治疗人类免疫缺陷病毒的RCT启发,我们通过实证比较了几种策略和三种生成技术(两种机器学习技术,另一种是更经典的统计方法)忠实地再现逼真数据的能力。我们的结果表明,使用带有R-vine copula模型的顺序生成方法来生成基线变量,然后通过简单的随机治疗分配来模拟RCT设置,并对治疗分配后的变量(如试验结果)使用后续回归模型,是生成能够捕捉真实数据重要且逼真特征的合成表格RCT数据的最有效方法。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/bd86/12345405/0afce29a0de1/SIM-44-0-g007.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验