评估合成乳腺癌临床试验数据集的效用和隐私性。

Evaluating the Utility and Privacy of Synthetic Breast Cancer Clinical Trial Data Sets.

机构信息

CHEO Research Institute, Ottawa, ON, Canada.

Replica Analytics Ltd, Ottawa, ON, Canada.

出版信息

JCO Clin Cancer Inform. 2023 Sep;7:e2300116. doi: 10.1200/CCI.23.00116.

DOI:10.1200/CCI.23.00116

PMID:38011617

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC10703127/

Abstract

PURPOSE

There is strong interest from patients, researchers, the pharmaceutical industry, medical journal editors, funders of research, and regulators in sharing clinical trial data for secondary analysis. However, data access remains a challenge because of concerns about patient privacy. It has been argued that synthetic data generation (SDG) is an effective way to address these privacy concerns. There is a dearth of evidence supporting this on oncology clinical trial data sets, and on the utility of privacy-preserving synthetic data. The objective of the proposed study is to validate the utility and privacy risks of synthetic clinical trial data sets across multiple SDG techniques.

METHODS

We synthesized data sets from eight breast cancer clinical trial data sets using three types of generative models: sequential synthesis, conditional generative adversarial network, and variational autoencoder. Synthetic data utility was evaluated by replicating the published analyses on the synthetic data and assessing concordance of effect estimates and CIs between real and synthetic data. Privacy was evaluated by measuring attribution disclosure risk and membership disclosure risk.

RESULTS

Utility was highest using the sequential synthesis method where all results were replicable and the CI overlap most similar or higher for seven of eight data sets. Both types of privacy risks were low across all three types of generative models.

DISCUSSION

Synthetic data using sequential synthesis methods can act as a proxy for real clinical trial data sets, and simultaneously have low privacy risks. This type of generative model can be one way to enable broader sharing of clinical trial data.

摘要

目的

患者、研究人员、制药行业、医学期刊编辑、研究资助者和监管机构都对共享临床试验数据进行二次分析表现出浓厚的兴趣。然而，由于对患者隐私的担忧，数据访问仍然是一个挑战。有人认为，合成数据生成（SDG）是解决这些隐私问题的有效方法。在肿瘤学临床试验数据集上，支持这种方法的证据很少，也缺乏关于隐私保护合成数据的实用性的证据。本研究旨在验证多种 SDG 技术的合成临床试验数据集的实用性和隐私风险。