Suppr超能文献

使用序贯树优化临床试验数据的合成

Optimizing the synthesis of clinical trial data using sequential trees.

作者信息

Emam Khaled El, Mosquera Lucy, Zheng Chaoyi

机构信息

School of Epidemiology and Public Health, Faculty of Medicine, University of Ottawa, Ottawa, Ontario, Canada.

Electronic Health Information Laboratory, Childrens Hospital of Eastern Ontario Research Institute, Ottawa, Ontario, Canada.

出版信息

J Am Med Inform Assoc. 2021 Jan 15;28(1):3-13. doi: 10.1093/jamia/ocaa249.

Abstract

OBJECTIVE

With the growing demand for sharing clinical trial data, scalable methods to enable privacy protective access to high-utility data are needed. Data synthesis is one such method. Sequential trees are commonly used to synthesize health data. It is hypothesized that the utility of the generated data is dependent on the variable order. No assessments of the impact of variable order on synthesized clinical trial data have been performed thus far. Through simulation, we aim to evaluate the variability in the utility of synthetic clinical trial data as variable order is randomly shuffled and implement an optimization algorithm to find a good order if variability is too high.

MATERIALS AND METHODS

Six oncology clinical trial datasets were evaluated in a simulation. Three utility metrics were computed comparing real and synthetic data: univariate similarity, similarity in multivariate prediction accuracy, and a distinguishability metric. Particle swarm was implemented to optimize variable order, and was compared with a curriculum learning approach to ordering variables.

RESULTS

As the number of variables in a clinical trial dataset increases, there is a pattern of a marked increase in variability of data utility with order. Particle swarm with a distinguishability hinge loss ensured adequate utility across all 6 datasets. The hinge threshold was selected to avoid overfitting which can create a privacy problem. This was superior to curriculum learning in terms of utility.

CONCLUSIONS

The optimization approach presented in this study gives a reliable way to synthesize high-utility clinical trial datasets.

摘要

目的

随着临床试验数据共享需求的不断增长,需要可扩展的方法来实现对高实用性数据的隐私保护访问。数据合成就是这样一种方法。顺序树常用于合成健康数据。据推测,生成数据的实用性取决于变量顺序。到目前为止,尚未对变量顺序对合成临床试验数据的影响进行评估。通过模拟,我们旨在评估当变量顺序随机打乱时合成临床试验数据实用性的变异性,并在变异性过高时实施优化算法以找到一个好的顺序。

材料与方法

在模拟中评估了六个肿瘤学临床试验数据集。计算了比较真实数据和合成数据的三个实用性指标:单变量相似性、多变量预测准确性的相似性以及一个可区分性指标。实施粒子群算法来优化变量顺序,并将其与一种用于变量排序的课程学习方法进行比较。

结果

随着临床试验数据集中变量数量的增加,数据实用性的变异性随着顺序有显著增加的趋势。具有可区分性铰链损失的粒子群算法确保了在所有6个数据集中都有足够的实用性。选择铰链阈值以避免可能导致隐私问题的过拟合。在实用性方面,这优于课程学习。

结论

本研究中提出的优化方法为合成高实用性临床试验数据集提供了一种可靠的方法。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/cee9/7810457/3d7692b439f2/ocaa249f1.jpg

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验