• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

使用来自国家数据集的临床数据,对多种样本量的合成数据集进行训练的机器学习模型,用于预测血压。

Machine learning models trained on synthetic datasets of multiple sample sizes for the use of predicting blood pressure from clinical data in a national dataset.

机构信息

School of Clinical Medicine, University of Cambridge, Cambridge, United Kingdom.

出版信息

PLoS One. 2023 Mar 16;18(3):e0283094. doi: 10.1371/journal.pone.0283094. eCollection 2023.

DOI:10.1371/journal.pone.0283094
PMID:36928534
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC10019654/
Abstract

INTRODUCTION

The potential for synthetic data to act as a replacement for real data in research has attracted attention in recent months due to the prospect of increasing access to data and overcoming data privacy concerns when sharing data. The field of generative artificial intelligence and synthetic data is still early in its development, with a research gap evidencing that synthetic data can adequately be used to train algorithms that can be used on real data. This study compares the performance of a series machine learning models trained on real data and synthetic data, based on the National Diet and Nutrition Survey (NDNS).

METHODS

Features identified to be potentially of relevance by directed acyclic graphs were isolated from the NDNS dataset and used to construct synthetic datasets and impute missing data. Recursive feature elimination identified only four variables needed to predict mean arterial blood pressure: age, sex, weight and height. Bayesian generalised linear regression, random forest and neural network models were constructed based on these four variables to predict blood pressure. Models were trained on the real data training set (n = 2408), a synthetic data training set (n = 2408) and larger synthetic data training set (n = 4816) and a combination of the real and synthetic data training set (n = 4816). The same test set (n = 424) was used for each model.

RESULTS

Synthetic datasets demonstrated a high degree of fidelity with the real dataset. There was no significant difference between the performance of models trained on real, synthetic or combined datasets. Mean average error across all models and all training data ranged from 8.12 To 8.33. This indicates that synthetic data was capable of training equally accurate machine learning models as real data.

DISCUSSION

Further research is needed on a variety of datasets to confirm the utility of synthetic data to replace the use of potentially identifiable patient data. There is also further urgent research needed into evidencing that synthetic data can truly protect patient privacy against adversarial attempts to re-identify real individuals from the synthetic dataset.

摘要

简介

由于增加数据访问和克服数据共享时的数据隐私问题的前景,最近几个月,合成数据作为研究中真实数据替代品的潜力引起了关注。生成式人工智能和合成数据领域仍处于早期发展阶段,研究空白表明,合成数据可以充分用于训练可以在真实数据上使用的算法。本研究基于国家饮食与营养调查(NDNS)比较了在真实数据和合成数据上训练的一系列机器学习模型的性能。

方法

通过有向无环图确定的潜在相关特征从 NDNS 数据集中分离出来,并用于构建合成数据集和插补缺失数据。递归特征消除仅识别出预测平均动脉血压所需的四个变量:年龄、性别、体重和身高。基于这四个变量,构建了贝叶斯广义线性回归、随机森林和神经网络模型来预测血压。模型在真实数据训练集(n=2408)、合成数据训练集(n=2408)和更大的合成数据训练集(n=4816)以及真实和合成数据训练集的组合(n=4816)上进行训练。每个模型都使用相同的测试集(n=424)。

结果

合成数据集与真实数据集具有高度的逼真度。在真实、合成或组合数据集上训练的模型性能之间没有显著差异。所有模型和所有训练数据的平均平均误差范围为 8.12 到 8.33。这表明合成数据能够训练出与真实数据一样准确的机器学习模型。

讨论

需要对各种数据集进行进一步研究,以确认合成数据替代潜在可识别患者数据的用途。还需要进一步紧急研究,证明合成数据确实可以保护患者隐私,防止对手试图从合成数据集中重新识别真实个体。

相似文献

1
Machine learning models trained on synthetic datasets of multiple sample sizes for the use of predicting blood pressure from clinical data in a national dataset.使用来自国家数据集的临床数据,对多种样本量的合成数据集进行训练的机器学习模型,用于预测血压。
PLoS One. 2023 Mar 16;18(3):e0283094. doi: 10.1371/journal.pone.0283094. eCollection 2023.
2
Ensemble machine learning model trained on a new synthesized dataset generalizes well for stress prediction using wearable devices.在新合成数据集上训练的集成机器学习模型,对于使用可穿戴设备进行压力预测具有良好的泛化能力。
J Biomed Inform. 2023 Dec;148:104556. doi: 10.1016/j.jbi.2023.104556. Epub 2023 Dec 2.
3
Synthetic Medical Images for Robust, Privacy-Preserving Training of Artificial Intelligence: Application to Retinopathy of Prematurity Diagnosis.用于人工智能稳健、隐私保护训练的合成医学图像:在早产儿视网膜病变诊断中的应用
Ophthalmol Sci. 2022 Feb 11;2(2):100126. doi: 10.1016/j.xops.2022.100126. eCollection 2022 Jun.
4
Demonstrating the successful application of synthetic learning in spine surgery for training multi-center models with increased patient privacy.展示了合成学习在脊柱外科中的成功应用,该方法用于训练具有更高患者隐私保护的多中心模型。
Sci Rep. 2023 Aug 1;13(1):12481. doi: 10.1038/s41598-023-39458-y.
5
Reliability of Supervised Machine Learning Using Synthetic Data in Health Care: Model to Preserve Privacy for Data Sharing.医疗保健中使用合成数据的监督式机器学习的可靠性:用于数据共享时保护隐私的模型
JMIR Med Inform. 2020 Jul 20;8(7):e18910. doi: 10.2196/18910.
6
Generative artificial intelligence to produce high-fidelity blastocyst-stage embryo images.生成式人工智能生成高保真囊胚期胚胎图像。
Hum Reprod. 2024 Jun 3;39(6):1197-1207. doi: 10.1093/humrep/deae064.
7
Creating High Fidelity Synthetic Pelvis Radiographs Using Generative Adversarial Networks: Unlocking the Potential of Deep Learning Models Without Patient Privacy Concerns.利用生成对抗网络生成高保真骨盆 X 射线:在不涉及患者隐私问题的情况下挖掘深度学习模型的潜力。
J Arthroplasty. 2023 Oct;38(10):2037-2043.e1. doi: 10.1016/j.arth.2022.12.013. Epub 2022 Dec 17.
8
Utilization of Synthetic Near-Infrared Spectra via Generative Adversarial Network to Improve Wood Stiffness Prediction.利用生成对抗网络的合成近红外光谱提高木材硬度预测
Sensors (Basel). 2024 Mar 21;24(6):1992. doi: 10.3390/s24061992.
9
Assessment of differentially private synthetic data for utility and fairness in end-to-end machine learning pipelines for tabular data.用于表格数据的端到端机器学习管道中效用和公平性的差分隐私合成数据评估。
PLoS One. 2024 Feb 5;19(2):e0297271. doi: 10.1371/journal.pone.0297271. eCollection 2024.
10
Generative Adversarial Networks for Creating Synthetic Free-Text Medical Data: A Proposal for Collaborative Research and Re-use of Machine Learning Models.生成对抗网络用于创建合成自由文本医疗数据:协作研究和机器学习模型再利用的建议。
AMIA Jt Summits Transl Sci Proc. 2021 May 17;2021:335-344. eCollection 2021.

引用本文的文献

1
Synthetic Data in Healthcare and Drug Development: Definitions, Regulatory Frameworks, Issues.医疗保健与药物研发中的合成数据:定义、监管框架、问题
CPT Pharmacometrics Syst Pharmacol. 2025 May;14(5):840-852. doi: 10.1002/psp4.70021. Epub 2025 Apr 7.
2
Synthetic data as an investigative tool in hypertension and renal diseases research.合成数据作为高血压和肾脏疾病研究中的一种调查工具。
World J Methodol. 2025 Mar 20;15(1):98626. doi: 10.5662/wjm.v15.i1.98626.
3
Actionability of Synthetic Data in a Heterogeneous and Rare Health Care Demographic: Adolescents and Young Adults With Cancer.

本文引用的文献

1
Ethics Principles for Artificial Intelligence-Based Telemedicine for Public Health.人工智能支持的公共卫生远程医疗的伦理原则。
Am J Public Health. 2023 May;113(5):577-584. doi: 10.2105/AJPH.2023.307225. Epub 2023 Mar 9.
2
A Multifaceted benchmarking of synthetic electronic health record generation models.综合电子健康记录生成模型的多方面基准测试。
Nat Commun. 2022 Dec 9;13(1):7609. doi: 10.1038/s41467-022-35295-1.
3
Multimodal data for systolic and diastolic blood pressure prediction: The hypertension conscious artificial intelligence.
合成数据在异质性和罕见医疗保健人群中的适用性:患有癌症的青少年和青年成年人
JCO Clin Cancer Inform. 2024 Dec;8:e2400056. doi: 10.1200/CCI.24.00056. Epub 2024 Dec 3.
用于收缩压和舒张压预测的多模态数据:高血压意识人工智能
EBioMedicine. 2022 Oct;84:104261. doi: 10.1016/j.ebiom.2022.104261. Epub 2022 Sep 13.
4
Generative adversarial networks and synthetic patient data: current challenges and future perspectives.生成对抗网络与合成患者数据:当前挑战与未来展望
Future Healthc J. 2022 Jul;9(2):190-193. doi: 10.7861/fhj.2022-0013.
5
Machine Learning Approaches for Predicting Hypertension and Its Associated Factors Using Population-Level Data From Three South Asian Countries.利用三个南亚国家的人口水平数据预测高血压及其相关因素的机器学习方法
Front Cardiovasc Med. 2022 Mar 31;9:839379. doi: 10.3389/fcvm.2022.839379. eCollection 2022.
6
Key considerations for the use of artificial intelligence in healthcare and clinical research.人工智能在医疗保健和临床研究中的关键考量因素。
Future Healthc J. 2022 Mar;9(1):75-78. doi: 10.7861/fhj.2021-0128.
7
Synthetic patient data in health care: a widening legal loophole.医疗保健领域的合成患者数据:一个不断扩大的法律漏洞。
Lancet. 2022 Apr 23;399(10335):1601-1602. doi: 10.1016/S0140-6736(22)00232-X. Epub 2022 Mar 28.
8
Application of generative adversarial networks (GAN) for ophthalmology image domains: a survey.生成对抗网络(GAN)在眼科图像领域的应用:一项综述。
Eye Vis (Lond). 2022 Feb 2;9(1):6. doi: 10.1186/s40662-022-00277-3.
9
Predicting the Risk of Hypertension Based on Several Easy-to-Collect Risk Factors: A Machine Learning Method.基于若干易于收集的风险因素预测高血压风险:一种机器学习方法。
Front Public Health. 2021 Sep 24;9:619429. doi: 10.3389/fpubh.2021.619429. eCollection 2021.
10
Synthetic data in machine learning for medicine and healthcare.机器学习在医学和医疗保健领域中的合成数据。
Nat Biomed Eng. 2021 Jun;5(6):493-497. doi: 10.1038/s41551-021-00751-8.