评估合成乳腺癌临床试验数据集的效用和隐私性。
Evaluating the Utility and Privacy of Synthetic Breast Cancer Clinical Trial Data Sets.
机构信息
CHEO Research Institute, Ottawa, ON, Canada.
Replica Analytics Ltd, Ottawa, ON, Canada.
出版信息
JCO Clin Cancer Inform. 2023 Sep;7:e2300116. doi: 10.1200/CCI.23.00116.
PURPOSE
There is strong interest from patients, researchers, the pharmaceutical industry, medical journal editors, funders of research, and regulators in sharing clinical trial data for secondary analysis. However, data access remains a challenge because of concerns about patient privacy. It has been argued that synthetic data generation (SDG) is an effective way to address these privacy concerns. There is a dearth of evidence supporting this on oncology clinical trial data sets, and on the utility of privacy-preserving synthetic data. The objective of the proposed study is to validate the utility and privacy risks of synthetic clinical trial data sets across multiple SDG techniques.
METHODS
We synthesized data sets from eight breast cancer clinical trial data sets using three types of generative models: sequential synthesis, conditional generative adversarial network, and variational autoencoder. Synthetic data utility was evaluated by replicating the published analyses on the synthetic data and assessing concordance of effect estimates and CIs between real and synthetic data. Privacy was evaluated by measuring attribution disclosure risk and membership disclosure risk.
RESULTS
Utility was highest using the sequential synthesis method where all results were replicable and the CI overlap most similar or higher for seven of eight data sets. Both types of privacy risks were low across all three types of generative models.
DISCUSSION
Synthetic data using sequential synthesis methods can act as a proxy for real clinical trial data sets, and simultaneously have low privacy risks. This type of generative model can be one way to enable broader sharing of clinical trial data.
目的
患者、研究人员、制药行业、医学期刊编辑、研究资助者和监管机构都对共享临床试验数据进行二次分析表现出浓厚的兴趣。然而,由于对患者隐私的担忧,数据访问仍然是一个挑战。有人认为,合成数据生成(SDG)是解决这些隐私问题的有效方法。在肿瘤学临床试验数据集上,支持这种方法的证据很少,也缺乏关于隐私保护合成数据的实用性的证据。本研究旨在验证多种 SDG 技术的合成临床试验数据集的实用性和隐私风险。
方法
我们使用三种生成模型(顺序合成、条件生成对抗网络和变分自动编码器)从八个乳腺癌临床试验数据集中合成数据集。通过在合成数据上复制已发表的分析,并评估真实数据和合成数据之间效应估计和 CI 的一致性,来评估合成数据的实用性。通过测量归因披露风险和成员披露风险来评估隐私。
结果
使用顺序合成方法时,实用性最高,其中所有结果均可复制,并且八个数据集中有七个的 CI 重叠度更高或更高。所有三种生成模型的两种类型的隐私风险都很低。
讨论
使用顺序合成方法的合成数据可以作为真实临床试验数据集的替代物,同时具有低隐私风险。这种生成模型可以是实现更广泛共享临床试验数据的一种方式。