Suppr超能文献

用于评估合成健康数据生成方法的效用指标:验证研究

Utility Metrics for Evaluating Synthetic Health Data Generation Methods: Validation Study.

作者信息

El Emam Khaled, Mosquera Lucy, Fang Xi, El-Hussuna Alaa

机构信息

School of Epidemiology and Public Health, University of Ottawa, Ottawa, ON, Canada.

Children's Hospital of Eastern Ontario Research Institute, Ottawa, ON, Canada.

出版信息

JMIR Med Inform. 2022 Apr 7;10(4):e35734. doi: 10.2196/35734.

Abstract

BACKGROUND

A regular task by developers and users of synthetic data generation (SDG) methods is to evaluate and compare the utility of these methods. Multiple utility metrics have been proposed and used to evaluate synthetic data. However, they have not been validated in general or for comparing SDG methods.

OBJECTIVE

This study evaluates the ability of common utility metrics to rank SDG methods according to performance on a specific analytic workload. The workload of interest is the use of synthetic data for logistic regression prediction models, which is a very frequent workload in health research.

METHODS

We evaluated 6 utility metrics on 30 different health data sets and 3 different SDG methods (a Bayesian network, a Generative Adversarial Network, and sequential tree synthesis). These metrics were computed by averaging across 20 synthetic data sets from the same generative model. The metrics were then tested on their ability to rank the SDG methods based on prediction performance. Prediction performance was defined as the difference between each of the area under the receiver operating characteristic curve and area under the precision-recall curve values on synthetic data logistic regression prediction models versus real data models.

RESULTS

The utility metric best able to rank SDG methods was the multivariate Hellinger distance based on a Gaussian copula representation of real and synthetic joint distributions.

CONCLUSIONS

This study has validated a generative model utility metric, the multivariate Hellinger distance, which can be used to reliably rank competing SDG methods on the same data set. The Hellinger distance metric can be used to evaluate and compare alternate SDG methods.

摘要

背景

合成数据生成(SDG)方法的开发者和用户的一项常规任务是评估和比较这些方法的效用。已经提出并使用了多种效用指标来评估合成数据。然而,它们尚未得到普遍验证,也未用于比较SDG方法。

目的

本研究评估常见效用指标根据特定分析工作量的性能对SDG方法进行排名的能力。感兴趣的工作量是将合成数据用于逻辑回归预测模型,这在健康研究中是非常常见的工作量。

方法

我们在30个不同的健康数据集和3种不同的SDG方法(贝叶斯网络、生成对抗网络和顺序树合成)上评估了6种效用指标。这些指标是通过对来自同一生成模型的20个合成数据集求平均值来计算的。然后测试这些指标根据预测性能对SDG方法进行排名的能力。预测性能定义为合成数据逻辑回归预测模型与真实数据模型上接收器操作特征曲线下面积和精确召回曲线下面积值之间的差异。

结果

最能对SDG方法进行排名的效用指标是基于真实和合成联合分布的高斯copula表示的多元Hellinger距离。

结论

本研究验证了一种生成模型效用指标——多元Hellinger距离,它可用于在同一数据集上可靠地对竞争的SDG方法进行排名。Hellinger距离指标可用于评估和比较替代的SDG方法。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/3159/9030990/3a774c248839/medinform_v10i4e35734_fig1.jpg

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验