Suppr超能文献

生成用于评估机器学习医疗软件的高保真合成患者数据。

Generating high-fidelity synthetic patient data for assessing machine learning healthcare software.

作者信息

Tucker Allan, Wang Zhenchen, Rotalinti Ylenia, Myles Puja

机构信息

Department of Computer Science, Brunel University London, London, UK.

CPRD, Medicines & Healthcare Products Regulatory Agency, London, UK.

出版信息

NPJ Digit Med. 2020 Nov 9;3(1):147. doi: 10.1038/s41746-020-00353-9.

Abstract

There is a growing demand for the uptake of modern artificial intelligence technologies within healthcare systems. Many of these technologies exploit historical patient health data to build powerful predictive models that can be used to improve diagnosis and understanding of disease. However, there are many issues concerning patient privacy that need to be accounted for in order to enable this data to be better harnessed by all sectors. One approach that could offer a method of circumventing privacy issues is the creation of realistic synthetic data sets that capture as many of the complexities of the original data set (distributions, non-linear relationships, and noise) but that does not actually include any real patient data. While previous research has explored models for generating synthetic data sets, here we explore the integration of resampling, probabilistic graphical modelling, latent variable identification, and outlier analysis for producing realistic synthetic data based on UK primary care patient data. In particular, we focus on handling missingness, complex interactions between variables, and the resulting sensitivity analysis statistics from machine learning classifiers, while quantifying the risks of patient re-identification from synthetic datapoints. We show that, through our approach of integrating outlier analysis with graphical modelling and resampling, we can achieve synthetic data sets that are not significantly different from original ground truth data in terms of feature distributions, feature dependencies, and sensitivity analysis statistics when inferring machine learning classifiers. What is more, the risk of generating synthetic data that is identical or very similar to real patients is shown to be low.

摘要

医疗保健系统对采用现代人工智能技术的需求日益增长。其中许多技术利用患者的历史健康数据来构建强大的预测模型,这些模型可用于改善疾病的诊断和理解。然而,为了使所有部门能够更好地利用这些数据,需要考虑许多与患者隐私相关的问题。一种可以提供规避隐私问题方法的途径是创建逼真的合成数据集,该数据集能够捕捉原始数据集的尽可能多的复杂性(分布、非线性关系和噪声),但实际上不包括任何真实患者数据。虽然先前的研究已经探索了生成合成数据集的模型,但在此我们探索将重采样、概率图形建模、潜在变量识别和异常值分析相结合,以基于英国初级保健患者数据生成逼真的合成数据。特别是,我们专注于处理数据缺失、变量之间的复杂交互以及机器学习分类器产生的敏感性分析统计数据,同时量化从合成数据点重新识别患者的风险。我们表明,通过将异常值分析与图形建模和重采样相结合的方法,在推断机器学习分类器时,我们可以获得在特征分布、特征依赖性和敏感性分析统计数据方面与原始真实数据没有显著差异的合成数据集。此外,生成与真实患者相同或非常相似的合成数据的风险被证明是很低的。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/098b/7653933/852dc577688e/41746_2020_353_Fig1_HTML.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验