使用合成数据生成器识别和处理初级医疗保健数据中的数据偏差。

Identifying and handling data bias within primary healthcare data using synthetic data generators.

作者信息

Draghi Barbara, Wang Zhenchen, Myles Puja, Tucker Allan

机构信息

Medicines and Healthcare products Regulatory Agency, London, UK.

Brunel University London, London, UK.

出版信息

Heliyon. 2024 Jan 10;10(2):e24164. doi: 10.1016/j.heliyon.2024.e24164. eCollection 2024 Jan 30.

DOI:10.1016/j.heliyon.2024.e24164

PMID:38288010

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC10823075/

Abstract

Advanced synthetic data generators can simulate data samples that closely resemble sensitive personal datasets while significantly reducing the risk of individual identification. The use of these advanced generators holds enormous potential in the medical field, as it allows for the simulation and sharing of sensitive patient data. This enables the development and rigorous validation of novel AI technologies for accurate diagnosis and efficient disease management. Despite the availability of massive ground truth datasets (such as UK-NHS databases that contain millions of patient records), the risk of biases being carried over to data generators still exists. These biases may arise from the under-representation of specific patient cohorts due to cultural sensitivities within certain communities or standardised data collection procedures. Machine learning models can exhibit bias in various forms, including the under-representation of certain groups in the data. This can lead to missing data and inaccurate correlations and distributions, which may also be reflected in synthetic data. Our paper aims to improve synthetic data generators by introducing probabilistic approaches to first detect difficult-to-predict data samples in ground truth data and then boost them when applying the generator. In addition, we explore strategies to generate synthetic data that can reduce bias and, at the same time, improve the performance of predictive models.

摘要

先进的合成数据生成器可以模拟与敏感个人数据集极为相似的数据样本，同时显著降低个人身份识别风险。这些先进生成器的应用在医学领域具有巨大潜力，因为它允许对敏感患者数据进行模拟和共享。这有助于开发和严格验证用于准确诊断和高效疾病管理的新型人工智能技术。尽管有大量的真实数据集（如包含数百万患者记录的英国国民医疗服务体系数据库）可供使用，但数据生成器仍存在引入偏差的风险。这些偏差可能源于某些社区的文化敏感性或标准化数据收集程序导致特定患者群体代表性不足。机器学习模型可能会以各种形式表现出偏差，包括数据中某些群体代表性不足。这可能导致数据缺失以及不准确的相关性和分布，这也可能反映在合成数据中。我们的论文旨在通过引入概率方法来改进合成数据生成器，首先在真实数据中检测难以预测的数据样本，然后在应用生成器时对其进行增强。此外，我们探索生成合成数据的策略，以减少偏差，同时提高预测模型的性能。