• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

使用合成数据生成器识别和处理初级医疗保健数据中的数据偏差。

Identifying and handling data bias within primary healthcare data using synthetic data generators.

作者信息

Draghi Barbara, Wang Zhenchen, Myles Puja, Tucker Allan

机构信息

Medicines and Healthcare products Regulatory Agency, London, UK.

Brunel University London, London, UK.

出版信息

Heliyon. 2024 Jan 10;10(2):e24164. doi: 10.1016/j.heliyon.2024.e24164. eCollection 2024 Jan 30.

DOI:10.1016/j.heliyon.2024.e24164
PMID:38288010
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC10823075/
Abstract

Advanced synthetic data generators can simulate data samples that closely resemble sensitive personal datasets while significantly reducing the risk of individual identification. The use of these advanced generators holds enormous potential in the medical field, as it allows for the simulation and sharing of sensitive patient data. This enables the development and rigorous validation of novel AI technologies for accurate diagnosis and efficient disease management. Despite the availability of massive ground truth datasets (such as UK-NHS databases that contain millions of patient records), the risk of biases being carried over to data generators still exists. These biases may arise from the under-representation of specific patient cohorts due to cultural sensitivities within certain communities or standardised data collection procedures. Machine learning models can exhibit bias in various forms, including the under-representation of certain groups in the data. This can lead to missing data and inaccurate correlations and distributions, which may also be reflected in synthetic data. Our paper aims to improve synthetic data generators by introducing probabilistic approaches to first detect difficult-to-predict data samples in ground truth data and then boost them when applying the generator. In addition, we explore strategies to generate synthetic data that can reduce bias and, at the same time, improve the performance of predictive models.

摘要

先进的合成数据生成器可以模拟与敏感个人数据集极为相似的数据样本,同时显著降低个人身份识别风险。这些先进生成器的应用在医学领域具有巨大潜力,因为它允许对敏感患者数据进行模拟和共享。这有助于开发和严格验证用于准确诊断和高效疾病管理的新型人工智能技术。尽管有大量的真实数据集(如包含数百万患者记录的英国国民医疗服务体系数据库)可供使用,但数据生成器仍存在引入偏差的风险。这些偏差可能源于某些社区的文化敏感性或标准化数据收集程序导致特定患者群体代表性不足。机器学习模型可能会以各种形式表现出偏差,包括数据中某些群体代表性不足。这可能导致数据缺失以及不准确的相关性和分布,这也可能反映在合成数据中。我们的论文旨在通过引入概率方法来改进合成数据生成器,首先在真实数据中检测难以预测的数据样本,然后在应用生成器时对其进行增强。此外,我们探索生成合成数据的策略,以减少偏差,同时提高预测模型的性能。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2465/10823075/9a8df8ec255f/gr014.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2465/10823075/378da5e97309/gr001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2465/10823075/358c71925775/gr002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2465/10823075/c437feaf2100/gr003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2465/10823075/201f0512a84f/gr004.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2465/10823075/6e31026ef3ce/gr005.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2465/10823075/f02d4916072a/gr006.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2465/10823075/2103cab3a4a0/gr007.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2465/10823075/42665486a252/gr008.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2465/10823075/99e7123f7284/gr009.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2465/10823075/7d8c4e0f9a55/gr010.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2465/10823075/f8b21cdd024e/gr011.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2465/10823075/ef6e13b83a6c/gr012.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2465/10823075/c5103d2385aa/gr013.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2465/10823075/9a8df8ec255f/gr014.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2465/10823075/378da5e97309/gr001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2465/10823075/358c71925775/gr002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2465/10823075/c437feaf2100/gr003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2465/10823075/201f0512a84f/gr004.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2465/10823075/6e31026ef3ce/gr005.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2465/10823075/f02d4916072a/gr006.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2465/10823075/2103cab3a4a0/gr007.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2465/10823075/42665486a252/gr008.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2465/10823075/99e7123f7284/gr009.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2465/10823075/7d8c4e0f9a55/gr010.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2465/10823075/f8b21cdd024e/gr011.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2465/10823075/ef6e13b83a6c/gr012.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2465/10823075/c5103d2385aa/gr013.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2465/10823075/9a8df8ec255f/gr014.jpg

相似文献

1
Identifying and handling data bias within primary healthcare data using synthetic data generators.使用合成数据生成器识别和处理初级医疗保健数据中的数据偏差。
Heliyon. 2024 Jan 10;10(2):e24164. doi: 10.1016/j.heliyon.2024.e24164. eCollection 2024 Jan 30.
2
Generating high-fidelity synthetic patient data for assessing machine learning healthcare software.生成用于评估机器学习医疗软件的高保真合成患者数据。
NPJ Digit Med. 2020 Nov 9;3(1):147. doi: 10.1038/s41746-020-00353-9.
3
Synthetic data generation methods in healthcare: A review on open-source tools and methods.医疗保健领域的合成数据生成方法:关于开源工具和方法的综述
Comput Struct Biotechnol J. 2024 Jul 9;23:2892-2910. doi: 10.1016/j.csbj.2024.07.005. eCollection 2024 Dec.
4
Inherent Bias in Electronic Health Records: A Scoping Review of Sources of Bias.电子健康记录中的固有偏差:偏差来源的范围综述
medRxiv. 2024 Apr 12:2024.04.09.24305594. doi: 10.1101/2024.04.09.24305594.
5
The future of Cochrane Neonatal.考克兰新生儿协作网的未来。
Early Hum Dev. 2020 Nov;150:105191. doi: 10.1016/j.earlhumdev.2020.105191. Epub 2020 Sep 12.
6
Sound therapy (using amplification devices and/or sound generators) for tinnitus.耳鸣的声疗法(使用放大设备和/或发声器)
Cochrane Database Syst Rev. 2018 Dec 27;12(12):CD013094. doi: 10.1002/14651858.CD013094.pub2.
7
Folic acid supplementation and malaria susceptibility and severity among people taking antifolate antimalarial drugs in endemic areas.在流行地区,服用抗叶酸抗疟药物的人群中,叶酸补充剂与疟疾易感性和严重程度的关系。
Cochrane Database Syst Rev. 2022 Feb 1;2(2022):CD014217. doi: 10.1002/14651858.CD014217.
8
MarkVCID cerebral small vessel consortium: I. Enrollment, clinical, fluid protocols.马克 VCID 脑小血管联盟:一、入组、临床、液体方案。
Alzheimers Dement. 2021 Apr;17(4):704-715. doi: 10.1002/alz.12215. Epub 2021 Jan 21.
9
Implicit Bias隐性偏见
10
Reliability of Supervised Machine Learning Using Synthetic Data in Health Care: Model to Preserve Privacy for Data Sharing.医疗保健中使用合成数据的监督式机器学习的可靠性:用于数据共享时保护隐私的模型
JMIR Med Inform. 2020 Jul 20;8(7):e18910. doi: 10.2196/18910.

引用本文的文献

1
Synthetic data distillation enables the extraction of clinical information at scale.合成数据提炼能够大规模提取临床信息。
NPJ Digit Med. 2025 May 10;8(1):267. doi: 10.1038/s41746-025-01681-4.
2
Enhancing generalization in a Kawasaki Disease prediction model using data augmentation: Cross-validation of patients from two major hospitals in Taiwan.利用数据增强提高川崎病预测模型的泛化能力:台湾两家主要医院患者的交叉验证
PLoS One. 2024 Dec 31;19(12):e0314995. doi: 10.1371/journal.pone.0314995. eCollection 2024.
3
Decades in the Making: The Evolution of Digital Health Research Infrastructure Through Synthetic Data, Common Data Models, and Federated Learning.

本文引用的文献

1
Generating high-fidelity synthetic patient data for assessing machine learning healthcare software.生成用于评估机器学习医疗软件的高保真合成患者数据。
NPJ Digit Med. 2020 Nov 9;3(1):147. doi: 10.1038/s41746-020-00353-9.
2
Sex and gender differences and biases in artificial intelligence for biomedicine and healthcare.生物医学与医疗保健领域人工智能中的性别差异与偏见
NPJ Digit Med. 2020 Jun 1;3:81. doi: 10.1038/s41746-020-0288-5. eCollection 2020.
3
Overview of artificial intelligence in medicine.医学中的人工智能概述。
数十年磨一剑:通过合成数据、通用数据模型和联邦学习实现数字健康研究基础设施的演进
J Med Internet Res. 2024 Dec 20;26:e58637. doi: 10.2196/58637.
J Family Med Prim Care. 2019 Jul;8(7):2328-2331. doi: 10.4103/jfmpc.jfmpc_440_19.
4
Data resource profile: Clinical Practice Research Datalink (CPRD) Aurum.数据资源简介:临床实践研究数据链(CPRD)奥鲁姆
Int J Epidemiol. 2019 Dec 1;48(6):1740-1740g. doi: 10.1093/ije/dyz034.
5
Gender bias in medicine.医学中的性别偏见。
Womens Health (Lond). 2008 May;4(3):237-43. doi: 10.2217/17455057.4.3.237.
6
The problem of bias in training data in regression problems in medical decision support.医学决策支持中回归问题训练数据的偏差问题。
Artif Intell Med. 2002 Jan;24(1):51-70. doi: 10.1016/s0933-3657(01)00092-6.
7
Man-made medicine and women's health: the biopolitics of sex/gender and race/ethnicity.人造药物与女性健康:性/性别和种族/族裔的生命政治
Int J Health Serv. 1994;24(2):265-83. doi: 10.2190/LWLH-NMCJ-UACL-U80Y.