Suppr超能文献

FanFAIR:敏感数据集半自动公平性评估

FanFAIR: sensitive data sets semi-automatic fairness assessment.

作者信息

Gallese Chiara, Scantamburlo Teresa, Manzoni Luca, Giannerini Simone, Nobile Marco S

机构信息

Department of Law, University of Turin, Lungo Dora Siena 100, 10153, Turin, Italy.

Tilburg Institute for Law, Technology, and Society (TILT), Tilburg University, Prof. Cobbenhagenlaan 221, Tilburg, 5037, The Netherlands.

出版信息

BMC Med Inform Decis Mak. 2025 Sep 12;25(Suppl 3):329. doi: 10.1186/s12911-025-03184-4.

Abstract

BACKGROUND

Research has shown how data sets convey social bias in Artificial Intelligence systems, especially those based on machine learning. A biased data set is not representative of reality and might contribute to perpetuate societal biases within the model. To tackle this problem, it is important to understand how to avoid biases, errors, and unethical practices while creating the data sets. In order to provide guidance for the use of data sets in contexts of critical decision-making, such as health decisions, we identified six fundamental data set features (balance, numerosity, unevenness, compliance, quality, incompleteness) that could affect model fairness. These features were the foundation for the FanFAIR framework.

RESULTS

We extended the FanFAIR framework for the semi-automated evaluation of fairness in data sets, by combining statistical information on data with qualitative features. In particular, we present an improved version of FanFAIR which introduces novel outlier detection capabilities working in multivariate fashion, using two state-of-the-art methods: the Empirical Cumulative-distribution Outlier Detection (ECOD) and Isolation Forest. We also introduce a novel metric for data set balance, based on an entropy measure.

CONCLUSION

We addressed the issue of how much (un)fairness can be included in a data set used for machine learning research, focusing on classification issues. We developed a rule-based approach based on fuzzy logic that combines these characteristics into a single score and enables a semi-automatic evaluation of a data set in algorithmic fairness research. Our tool produces a detailed visual report about the fairness of the data set. We show the effectiveness of FanFAIR by applying the method on two open data sets.

摘要

背景

研究表明,数据集如何在人工智能系统中传递社会偏见,尤其是那些基于机器学习的系统。有偏差的数据集不能代表现实,可能会导致模型中的社会偏见长期存在。为了解决这个问题,在创建数据集时了解如何避免偏见、错误和不道德行为非常重要。为了在关键决策环境(如健康决策)中为数据集的使用提供指导,我们确定了六个可能影响模型公平性的基本数据集特征(平衡性、数量、不均衡性、合规性、质量、不完整性)。这些特征是FanFAIR框架的基础。

结果

我们通过将数据的统计信息与定性特征相结合,扩展了用于半自动评估数据集公平性的FanFAIR框架。特别是,我们提出了FanFAIR的改进版本,它引入了以多变量方式工作的新型异常值检测功能,使用两种最先进的方法:经验累积分布异常值检测(ECOD)和孤立森林。我们还基于熵度量引入了一种新的数据集平衡度量。

结论

我们解决了用于机器学习研究的数据集中可以包含多少(不)公平性的问题,重点是分类问题。我们开发了一种基于模糊逻辑的基于规则的方法,将这些特征组合成一个单一分数,并能够在算法公平性研究中对数据集进行半自动评估。我们的工具生成了一份关于数据集公平性的详细可视化报告。我们通过将该方法应用于两个开放数据集来展示FanFAIR的有效性。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/6161/12427094/1483e78862f3/12911_2025_3184_Fig1_HTML.jpg

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验