Suppr超能文献

使用来自国家数据集的临床数据,对多种样本量的合成数据集进行训练的机器学习模型,用于预测血压。

Machine learning models trained on synthetic datasets of multiple sample sizes for the use of predicting blood pressure from clinical data in a national dataset.

机构信息

School of Clinical Medicine, University of Cambridge, Cambridge, United Kingdom.

出版信息

PLoS One. 2023 Mar 16;18(3):e0283094. doi: 10.1371/journal.pone.0283094. eCollection 2023.

Abstract

INTRODUCTION

The potential for synthetic data to act as a replacement for real data in research has attracted attention in recent months due to the prospect of increasing access to data and overcoming data privacy concerns when sharing data. The field of generative artificial intelligence and synthetic data is still early in its development, with a research gap evidencing that synthetic data can adequately be used to train algorithms that can be used on real data. This study compares the performance of a series machine learning models trained on real data and synthetic data, based on the National Diet and Nutrition Survey (NDNS).

METHODS

Features identified to be potentially of relevance by directed acyclic graphs were isolated from the NDNS dataset and used to construct synthetic datasets and impute missing data. Recursive feature elimination identified only four variables needed to predict mean arterial blood pressure: age, sex, weight and height. Bayesian generalised linear regression, random forest and neural network models were constructed based on these four variables to predict blood pressure. Models were trained on the real data training set (n = 2408), a synthetic data training set (n = 2408) and larger synthetic data training set (n = 4816) and a combination of the real and synthetic data training set (n = 4816). The same test set (n = 424) was used for each model.

RESULTS

Synthetic datasets demonstrated a high degree of fidelity with the real dataset. There was no significant difference between the performance of models trained on real, synthetic or combined datasets. Mean average error across all models and all training data ranged from 8.12 To 8.33. This indicates that synthetic data was capable of training equally accurate machine learning models as real data.

DISCUSSION

Further research is needed on a variety of datasets to confirm the utility of synthetic data to replace the use of potentially identifiable patient data. There is also further urgent research needed into evidencing that synthetic data can truly protect patient privacy against adversarial attempts to re-identify real individuals from the synthetic dataset.

摘要

简介

由于增加数据访问和克服数据共享时的数据隐私问题的前景,最近几个月,合成数据作为研究中真实数据替代品的潜力引起了关注。生成式人工智能和合成数据领域仍处于早期发展阶段,研究空白表明,合成数据可以充分用于训练可以在真实数据上使用的算法。本研究基于国家饮食与营养调查(NDNS)比较了在真实数据和合成数据上训练的一系列机器学习模型的性能。

方法

通过有向无环图确定的潜在相关特征从 NDNS 数据集中分离出来,并用于构建合成数据集和插补缺失数据。递归特征消除仅识别出预测平均动脉血压所需的四个变量:年龄、性别、体重和身高。基于这四个变量,构建了贝叶斯广义线性回归、随机森林和神经网络模型来预测血压。模型在真实数据训练集(n=2408)、合成数据训练集(n=2408)和更大的合成数据训练集(n=4816)以及真实和合成数据训练集的组合(n=4816)上进行训练。每个模型都使用相同的测试集(n=424)。

结果

合成数据集与真实数据集具有高度的逼真度。在真实、合成或组合数据集上训练的模型性能之间没有显著差异。所有模型和所有训练数据的平均平均误差范围为 8.12 到 8.33。这表明合成数据能够训练出与真实数据一样准确的机器学习模型。

讨论

需要对各种数据集进行进一步研究,以确认合成数据替代潜在可识别患者数据的用途。还需要进一步紧急研究,证明合成数据确实可以保护患者隐私,防止对手试图从合成数据集中重新识别真实个体。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验