Shang Jingbo, Jiang Meng, Tong Wenzhu, Xiao Jinfeng, Peng Jian, Han Jiawei
Department of Computer Science in University of Illinois at Urbana-Champaign, IL, USA.
IEEE Trans Knowl Data Eng. 2018 Jul;30(7):1226-1239. doi: 10.1109/TKDE.2017.2757476. Epub 2017 Sep 28.
In the literature, two series of models have been proposed to address prediction problems including classification and regression. Simple models, such as generalized linear models, have ordinary performance but strong interpretability on a set of simple features. The other series, including tree-based models, organize numerical, categorical and high dimensional features into a comprehensive structure with rich interpretable information in the data. In this paper, we propose a novel Discriminative Pattern-based Prediction framework (DPPred) to accomplish the prediction tasks by taking their advantages of both effectiveness and interpretability. Specifically, DPPred adopts the concise discriminative patterns that are on the prefix paths from the root to leaf nodes in the tree-based models. DPPred selects a limited number of the useful discriminative patterns by searching for the most effective pattern combination to fit generalized linear models. Extensive experiments show that in many scenarios, DPPred provides competitive accuracy with the state-of-the-art as well as the valuable interpretability for developers and experts. In particular, taking a clinical application dataset as a case study, our DPPred outperforms the baselines by using only 40 concise discriminative patterns out of a potentially exponentially large set of patterns.
在文献中,已经提出了两类模型来解决包括分类和回归在内的预测问题。简单模型,如广义线性模型,性能普通,但对一组简单特征具有很强的可解释性。另一类模型,包括基于树的模型,将数值、分类和高维特征组织成一个综合结构,数据中具有丰富的可解释信息。在本文中,我们提出了一种新颖的基于判别模式的预测框架(DPPred),通过利用其有效性和可解释性的优势来完成预测任务。具体而言,DPPred采用基于树的模型中从根节点到叶节点的前缀路径上的简洁判别模式。DPPred通过搜索最有效的模式组合来拟合广义线性模型,从而选择有限数量的有用判别模式。大量实验表明,在许多情况下,DPPred提供了与现有技术相当的准确率,同时为开发者和专家提供了有价值的可解释性。特别是,以一个临床应用数据集为例,我们的DPPred仅使用潜在的指数级大量模式中的40个简洁判别模式就优于基线。