使用序贯树优化临床试验数据的合成

Optimizing the synthesis of clinical trial data using sequential trees.

作者信息

Emam Khaled El, Mosquera Lucy, Zheng Chaoyi

机构信息

School of Epidemiology and Public Health, Faculty of Medicine, University of Ottawa, Ottawa, Ontario, Canada.

Electronic Health Information Laboratory, Childrens Hospital of Eastern Ontario Research Institute, Ottawa, Ontario, Canada.

出版信息

J Am Med Inform Assoc. 2021 Jan 15;28(1):3-13. doi: 10.1093/jamia/ocaa249.

DOI:10.1093/jamia/ocaa249

PMID:33186440

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC7810457/

Abstract

OBJECTIVE

With the growing demand for sharing clinical trial data, scalable methods to enable privacy protective access to high-utility data are needed. Data synthesis is one such method. Sequential trees are commonly used to synthesize health data. It is hypothesized that the utility of the generated data is dependent on the variable order. No assessments of the impact of variable order on synthesized clinical trial data have been performed thus far. Through simulation, we aim to evaluate the variability in the utility of synthetic clinical trial data as variable order is randomly shuffled and implement an optimization algorithm to find a good order if variability is too high.

MATERIALS AND METHODS

Six oncology clinical trial datasets were evaluated in a simulation. Three utility metrics were computed comparing real and synthetic data: univariate similarity, similarity in multivariate prediction accuracy, and a distinguishability metric. Particle swarm was implemented to optimize variable order, and was compared with a curriculum learning approach to ordering variables.

RESULTS

As the number of variables in a clinical trial dataset increases, there is a pattern of a marked increase in variability of data utility with order. Particle swarm with a distinguishability hinge loss ensured adequate utility across all 6 datasets. The hinge threshold was selected to avoid overfitting which can create a privacy problem. This was superior to curriculum learning in terms of utility.

CONCLUSIONS

The optimization approach presented in this study gives a reliable way to synthesize high-utility clinical trial datasets.

摘要

目的

随着临床试验数据共享需求的不断增长，需要可扩展的方法来实现对高实用性数据的隐私保护访问。数据合成就是这样一种方法。顺序树常用于合成健康数据。据推测，生成数据的实用性取决于变量顺序。到目前为止，尚未对变量顺序对合成临床试验数据的影响进行评估。通过模拟，我们旨在评估当变量顺序随机打乱时合成临床试验数据实用性的变异性，并在变异性过高时实施优化算法以找到一个好的顺序。

材料与方法

在模拟中评估了六个肿瘤学临床试验数据集。计算了比较真实数据和合成数据的三个实用性指标：单变量相似性、多变量预测准确性的相似性以及一个可区分性指标。实施粒子群算法来优化变量顺序，并将其与一种用于变量排序的课程学习方法进行比较。

结果

随着临床试验数据集中变量数量的增加，数据实用性的变异性随着顺序有显著增加的趋势。具有可区分性铰链损失的粒子群算法确保了在所有6个数据集中都有足够的实用性。选择铰链阈值以避免可能导致隐私问题的过拟合。在实用性方面，这优于课程学习。

结论

本研究中提出的优化方法为合成高实用性临床试验数据集提供了一种可靠的方法。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/cee9/7810457/3d7692b439f2/ocaa249f1.jpg

相似文献

Optimizing the synthesis of clinical trial data using sequential trees.使用序贯树优化临床试验数据的合成

J Am Med Inform Assoc. 2021 Jan 15;28(1):3-13. doi: 10.1093/jamia/ocaa249.

The Costs of Anonymization: Case Study Using Clinical Data.匿名化的成本：使用临床数据的案例研究

J Med Internet Res. 2024 Apr 24;26:e49445. doi: 10.2196/49445.

Protecting patient privacy when sharing patient-level data from clinical trials.在共享临床试验中患者层面的数据时保护患者隐私。

BMC Med Res Methodol. 2016 Jul 8;16 Suppl 1(Suppl 1):77. doi: 10.1186/s12874-016-0169-4.

Current recommendations/practices for anonymising data from clinical trials in order to make it available for sharing: A scoping review.当前为了使临床试验数据可供共享而对其进行匿名化的建议/实践：范围综述。

Clin Trials. 2022 Aug;19(4):452-463. doi: 10.1177/17407745221087469. Epub 2022 Jun 22.

Preparing individual patient data from clinical trials for sharing: the GlaxoSmithKline approach.准备来自临床试验的个体患者数据以供共享：葛兰素史克公司的方法。

Pharm Stat. 2014 May-Jun;13(3):179-83. doi: 10.1002/pst.1615. Epub 2014 Mar 25.

Utility-Preserving Anonymization in a Real-World Scenario: Evidence from the German Chronic Kidney Disease (GCKD) Study.实用匿名化在真实场景中的应用：来自德国慢性肾脏病（GCKD）研究的证据。

Stud Health Technol Inform. 2023 May 18;302:28-32. doi: 10.3233/SHTI230058.

A scalable software solution for anonymizing high-dimensional biomedical data.一种可扩展的软件解决方案，用于对高维生物医学数据进行匿名化处理。

Gigascience. 2021 Oct 4;10(10). doi: 10.1093/gigascience/giab068.

A Global, Neutral Platform for Sharing Trial Data.一个用于共享试验数据的全球中立平台。

N Engl J Med. 2016 Jun 23;374(25):2411-3. doi: 10.1056/NEJMp1605348. Epub 2016 May 11.

Simulants: Synthetic Clinical Trial Data via Subject-Level Privacy-Preserving Synthesis.模拟物：通过基于受试者的隐私保护综合方法生成的合成临床试验数据。

AMIA Annu Symp Proc. 2023 Apr 29;2022:231-240. eCollection 2022.

Exploring the tradeoff between data privacy and utility with a clinical data analysis use case.探讨临床数据分析用例中数据隐私与效用之间的权衡。

BMC Med Inform Decis Mak. 2024 May 30;24(1):147. doi: 10.1186/s12911-024-02545-9.

引用本文的文献

Magnitude and Impact of Hallucinations in Tabular Synthetic Health Data on Prognostic Machine Learning Models: Validation Study.表格合成健康数据中的幻觉对预后机器学习模型的影响程度及验证研究

J Med Internet Res. 2025 Aug 18;27:e77893. doi: 10.2196/77893.

Synthetic Data for Sharing and Exploration in High-Performance Sport: Considerations for Application.高性能运动中用于共享和探索的合成数据：应用考量

Sports Med. 2025 Jun 26. doi: 10.1007/s40279-025-02221-6.

A synthetic data-driven machine learning approach for athlete performance attenuation prediction.一种用于运动员成绩衰减预测的合成数据驱动机器学习方法。

Front Sports Act Living. 2025 May 27;7:1607600. doi: 10.3389/fspor.2025.1607600. eCollection 2025.

SeqTrial: Utility Preserving Sequential Clinical Trial Data Generator.SeqTrial：实用程序保留顺序临床试验数据生成器。

AMIA Annu Symp Proc. 2025 May 22;2024:329-338. eCollection 2024.

Augmenting Insufficiently Accruing Oncology Clinical Trials Using Generative Models: Validation Study.使用生成模型增强入组不足的肿瘤学临床试验：验证研究

J Med Internet Res. 2025 Mar 5;27:e66821. doi: 10.2196/66821.

Creating High-Quality Synthetic Health Data: Framework for Model Development and Validation.创建高质量合成健康数据：模型开发与验证框架。

JMIR Form Res. 2024 Apr 22;8:e53241. doi: 10.2196/53241.

An evaluation of synthetic data augmentation for mitigating covariate bias in health data.评估合成数据增强以减轻健康数据中的协变量偏差。

Patterns (N Y). 2024 Feb 29;5(4):100946. doi: 10.1016/j.patter.2024.100946. eCollection 2024 Apr 12.

An evaluation of the replicability of analyses using synthetic health data.利用合成健康数据评估分析结果的可重复性。

Sci Rep. 2024 Mar 24;14(1):6978. doi: 10.1038/s41598-024-57207-7.

Evaluating the Utility and Privacy of Synthetic Breast Cancer Clinical Trial Data Sets.评估合成乳腺癌临床试验数据集的效用和隐私性。

JCO Clin Cancer Inform. 2023 Sep;7:e2300116. doi: 10.1200/CCI.23.00116.

A comparison of synthetic data generation and federated analysis for enabling international evaluations of cardiovascular health.利用合成数据生成和联邦分析促进心血管健康的国际评估比较。

Sci Rep. 2023 Jul 17;13(1):11540. doi: 10.1038/s41598-023-38457-3.

本文引用的文献

Generating Electronic Health Records with Multiple Data Types and Constraints.生成具有多种数据类型和约束的电子健康记录。

AMIA Annu Symp Proc. 2021 Jan 25;2020:1335-1344. eCollection 2020.

Obtaining and managing data sets for individual participant data meta-analysis: scoping review and practical guide.个体参与者数据荟萃分析中获取和管理数据集：范围综述与实用指南。

BMC Med Res Methodol. 2020 May 12;20(1):113. doi: 10.1186/s12874-020-00964-6.

Generation and evaluation of synthetic patient data.生成和评估合成患者数据。

BMC Med Res Methodol. 2020 May 7;20(1):108. doi: 10.1186/s12874-020-00977-1.

A synthetic dataset primer for the biobehavioural sciences to promote reproducibility and hypothesis generation.生物行为科学合成数据集入门，促进可重复性和假设生成。

Elife. 2020 Mar 11;9:e53275. doi: 10.7554/eLife.53275.

European Medicines Agency Policy 0070: an exploratory review of data utility in clinical study reports for academic research.欧洲药品管理局政策 0070：学术研究中临床研究报告数据实用性的探索性评价。

BMC Med Res Methodol. 2019 Nov 5;19(1):204. doi: 10.1186/s12874-019-0836-3.

Privacy-Preserving Generative Deep Neural Networks Support Clinical Data Sharing.隐私保护生成式深度神经网络支持临床数据共享。

Circ Cardiovasc Qual Outcomes. 2019 Jul;12(7):e005122. doi: 10.1161/CIRCOUTCOMES.118.005122. Epub 2019 Jul 9.

Re-identification Risks in HIPAA Safe Harbor Data: A study of data from one environmental health study.《健康保险流通与责任法案》安全港数据中的重新识别风险：一项对来自一项环境卫生研究数据的研究

Technol Sci. 2017;2017. Epub 2017 Aug 28.

Using 26,000 diary entries to show ovulatory changes in sexual desire and behavior.使用 26000 篇日记记录排卵周期中性欲和性行为的变化。

J Pers Soc Psychol. 2021 Aug;121(2):410-431. doi: 10.1037/pspp0000208. Epub 2018 Aug 27.

Data sharing and reanalysis of randomized controlled trials in leading biomedical journals with a full data sharing policy: survey of studies published in and .对具有完整数据共享政策的主要生物医学期刊中的随机对照试验进行数据共享和重新分析：对[具体年份1]和[具体年份2]发表的研究的调查

BMJ. 2018 Feb 13;360:k400. doi: 10.1136/bmj.k400.

Efforts to retrieve individual participant data sets for use in a meta-analysis result in moderate data sharing but many data sets remain missing.为在荟萃分析中使用而检索个体参与者数据集的努力带来了适度的数据共享，但许多数据集仍然缺失。

J Clin Epidemiol. 2018 Jun;98:157-159. doi: 10.1016/j.jclinepi.2017.12.014. Epub 2017 Dec 27.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

使用序贯树优化临床试验数据的合成

Optimizing the synthesis of clinical trial data using sequential trees.

作者信息

机构信息

出版信息

OBJECTIVE

MATERIALS AND METHODS

RESULTS

CONCLUSIONS

目的

材料与方法

结果

结论

相似文献

引用本文的文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献