IEEE Trans Neural Netw Learn Syst. 2015 Nov;26(11):2664-77. doi: 10.1109/TNNLS.2015.2389037. Epub 2015 Jan 23.
Many of the state-of-the-art data mining techniques introduce nonlinearities in their models to cope with complex data relationships effectively. Although such techniques are consistently included among the top classification techniques in terms of predictive power, their lack of transparency renders them useless in any domain where comprehensibility is of importance. Rule-extraction algorithms remedy this by distilling comprehensible rule sets from complex models that explain how the classifications are made. This paper considers a new rule extraction technique, based on active learning. The technique generates artificial data points around training data with low confidence in the output score, after which these are labeled by the black-box model. The main novelty of the proposed method is that it uses a pedagogical approach without making any architectural assumptions of the underlying model. It can therefore be applied to any black-box technique. Furthermore, it can generate any rule format, depending on the chosen underlying rule induction technique. In a large-scale empirical study, we demonstrate the validity of our technique to extract trees and rules from artificial neural networks, support vector machines, and random forests, on 25 data sets of varying size and dimensionality. Our results show that not only do the generated rules explain the black-box models well (thereby facilitating the acceptance of such models), the proposed algorithm also performs significantly better than traditional rule induction techniques in terms of accuracy as well as fidelity.
许多最先进的数据挖掘技术在其模型中引入了非线性,以有效地应对复杂的数据关系。尽管这些技术在预测能力方面一直被列为顶级分类技术之一,但由于缺乏透明度,在任何需要可理解性的领域,它们都毫无用处。规则提取算法通过从复杂模型中提取可理解的规则集来解决这个问题,这些规则集解释了如何进行分类。本文考虑了一种新的基于主动学习的规则提取技术。该技术在输出得分置信度低的情况下,在训练数据周围生成人工数据点,然后由黑盒模型对这些数据点进行标记。所提出方法的主要新颖之处在于,它使用了一种教学方法,而不对底层模型做出任何架构假设。因此,它可以应用于任何黑盒技术。此外,它可以根据所选的底层规则归纳技术生成任何规则格式。在一项大规模的实证研究中,我们展示了我们的技术从人工神经网络、支持向量机和随机森林中提取树和规则的有效性,该技术适用于 25 个大小和维度不同的数据集。我们的结果表明,生成的规则不仅可以很好地解释黑盒模型(从而促进对这些模型的接受),而且与传统的规则归纳技术相比,该算法在准确性和保真度方面也表现得更好。