Suppr超能文献

可解释性存乎观者之心:人类可解释表征学习的因果框架。

Interpretability Is in the Mind of the Beholder: A Causal Framework for Human-Interpretable Representation Learning.

作者信息

Marconato Emanuele, Passerini Andrea, Teso Stefano

机构信息

Dipartimento di Ingegneria e Scienza dell'Informazione, University of Trento, 38123 Trento, Italy.

Dipartimento di Informatica, University of Pisa, 56126 Pisa, Italy.

出版信息

Entropy (Basel). 2023 Nov 22;25(12):1574. doi: 10.3390/e25121574.

Abstract

Research on Explainable Artificial Intelligence has recently started exploring the idea of producing explanations that, rather than being expressed in terms of low-level features, are encoded in terms of . How to reliably acquire such concepts is, however, still fundamentally unclear. An agreed-upon notion of concept interpretability is missing, with the result that concepts used by both post hoc explainers and neural networks are acquired through a variety of mutually incompatible strategies. Critically, most of these neglect the human side of the problem: . The key challenge in human-interpretable representation learning (hrl) is how to model and operationalize this human element. In this work, we propose a mathematical framework for acquiring suitable for both post hoc explainers and concept-based neural networks. Our formalization of hrl builds on recent advances in causal representation learning and explicitly models a human stakeholder as an external observer. This allows us derive a principled notion of between the machine's representation and the vocabulary of concepts understood by the human. In doing so, we link alignment and interpretability through a simple and intuitive game, and clarify the relationship between alignment and a well-known property of representations, namely . We also show that alignment is linked to the issue of undesirable correlations among concepts, also known as , and to content-style separation, all through a general information-theoretic reformulation of these properties. Our conceptualization aims to bridge the gap between the human and algorithmic sides of interpretability and establish a stepping stone for new research on human-interpretable representations.

摘要

可解释人工智能的研究最近开始探索这样一种观点,即所产生的解释不是用低级特征来表达,而是用……来编码。然而,如何可靠地获取这些概念在根本上仍不明确。目前缺少一个公认的概念可解释性的概念,结果是事后解释器和神经网络所使用的概念是通过各种相互不兼容的策略来获取的。至关重要的是,其中大多数都忽略了问题的人为方面:……。人类可解释表示学习(hrl)中的关键挑战是如何对这一人为因素进行建模和操作。在这项工作中,我们提出了一个数学框架,用于获取适用于事后解释器和基于概念的神经网络的……。我们对hrl的形式化建立在因果表示学习的最新进展之上,并将人类利益相关者明确建模为外部观察者。这使我们能够得出机器表示与人类所理解的概念词汇之间的一个有原则的……概念。在此过程中,我们通过一个简单直观的……博弈将对齐与可解释性联系起来,并阐明对齐与表示的一个著名属性(即……)之间的关系。我们还表明,通过对这些属性进行一般的信息理论重新表述,对齐与概念之间不良相关性(也称为……)问题以及内容与风格分离问题相关联。我们的概念化旨在弥合可解释性在人类和算法方面之间的差距,并为人类可解释表示的新研究奠定基础。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1eb5/10742865/1730e6c8659f/entropy-25-01574-g001.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验