Galhotra Sainyam, Shanmugam Karthikeyan, Sattigeri Prasanna, Varshney Kush R
Department of Computer Science, University of Chicago, Chicago, IL 60637, USA.
IBM Research, Yorktown Heights, NY 10598, USA.
Entropy (Basel). 2021 Nov 25;23(12):1571. doi: 10.3390/e23121571.
The deployment of machine learning (ML) systems in applications with societal impact has motivated the study of fairness for marginalized groups. Often, the protected attribute is absent from the training dataset for legal reasons. However, datasets still contain proxy attributes that capture protected information and can inject unfairness in the ML model. Some deployed systems allow auditors, decision makers, or affected users to report issues or seek recourse by flagging individual samples. In this work, we examine such systems and consider a feedback-based framework where the protected attribute is unavailable and the flagged samples are indirect knowledge. The reported samples are used as guidance to identify the proxy attributes that are causally dependent on the (unknown) protected attribute. We work under the causal interventional fairness paradigm. Without requiring the underlying structural causal model a priori, we propose an approach that performs conditional independence tests on observed data to identify such proxy attributes. We theoretically prove the optimality of our algorithm, bound its complexity, and complement it with an empirical evaluation demonstrating its efficacy on various real-world and synthetic datasets.
机器学习(ML)系统在具有社会影响的应用中的部署激发了对边缘化群体公平性的研究。通常,由于法律原因,训练数据集中不存在受保护属性。然而,数据集仍然包含捕获受保护信息的代理属性,并且可能在ML模型中引入不公平性。一些已部署的系统允许审计人员、决策者或受影响的用户通过标记单个样本报告问题或寻求补救。在这项工作中,我们研究此类系统,并考虑一个基于反馈的框架,其中受保护属性不可用,标记的样本是间接知识。报告的样本用作指导,以识别因果依赖于(未知)受保护属性的代理属性。我们在因果干预公平范式下工作。在无需先验底层结构因果模型的情况下,我们提出一种方法,对观测数据执行条件独立性测试以识别此类代理属性。我们从理论上证明了算法的最优性,界定了其复杂度,并用实证评估对其进行补充,证明其在各种真实世界和合成数据集上的有效性。