Suppr超能文献

生成式人工智能通过合成健康数据减轻表征偏差并提高模型公平性。

Generative AI mitigates representation bias and improves model fairness through synthetic health data.

作者信息

Marchesi Raffaele, Micheletti Nicolo, I-Hsien Kuo Nicholas, Barbieri Sebastiano, Jurman Giuseppe, Osmani Venet

机构信息

Data Science for Health (DSH), Fondazione Bruno Kessler, Trento, Italy.

Department of Mathematics, University of Pavia, Pavia, Italy.

出版信息

PLoS Comput Biol. 2025 May 19;21(5):e1013080. doi: 10.1371/journal.pcbi.1013080. eCollection 2025 May.

Abstract

Representation bias in health data can lead to unfair decisions and compromise the generalisability of research findings. As a consequence, underrepresented subpopulations, such as those from specific ethnic backgrounds or genders, do not benefit equally from clinical discoveries. Several approaches have been developed to mitigate representation bias, ranging from simple resampling methods, such as SMOTE, to recent approaches based on generative adversarial networks (GAN). However, generating high-dimensional time-series synthetic health data remains a significant challenge. In response, we devised a novel architecture (CA-GAN) that synthesises authentic, high-dimensional time series data. CA-GAN outperforms state-of-the-art methods in a qualitative and a quantitative evaluation while avoiding mode collapse, a serious GAN failure. We perform evaluation using 7535 patients with hypotension and sepsis from two diverse, real-world clinical datasets. We show that synthetic data generated by our CA-GAN improves model fairness in Black patients as well as female patients when evaluated separately for each subpopulation. Furthermore, CA-GAN generates authentic data of the minority class while faithfully maintaining the original distribution of data, resulting in improved performance in a downstream predictive task.

摘要

健康数据中的代表性偏差可能导致不公平的决策,并损害研究结果的普遍性。因此,代表性不足的亚群体,如来自特定种族背景或性别的群体,无法平等地从临床发现中受益。已经开发了几种方法来减轻代表性偏差,从简单的重采样方法,如SMOTE,到基于生成对抗网络(GAN)的最新方法。然而,生成高维时间序列合成健康数据仍然是一项重大挑战。作为回应,我们设计了一种新颖的架构(CA-GAN),用于合成真实的高维时间序列数据。在定性和定量评估中,CA-GAN优于现有方法,同时避免了模式崩溃,这是GAN的一个严重故障。我们使用来自两个不同的真实世界临床数据集的7535名低血压和脓毒症患者进行评估。我们表明,当对每个亚群体分别进行评估时,我们的CA-GAN生成的合成数据提高了黑人患者和女性患者的模型公平性。此外,CA-GAN生成少数群体类别的真实数据,同时忠实地保持数据的原始分布,从而在下游预测任务中提高性能。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e03a/12112403/813e1a5dcdeb/pcbi.1013080.g001.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验