Mentch Lucas, Hooker Giles
Department of Statistical Science Cornell University.
J Comput Graph Stat. 2017;26(3):589-597. doi: 10.1080/10618600.2016.1256817. Epub 2017 Apr 17.
While statistical learning methods have proved powerful tools for predictive modeling, the black-box nature of the models they produce can severely limit their interpretability and the ability to conduct formal inference. However, the natural structure of ensemble learners like bagged trees and random forests has been shown to admit desirable asymptotic properties when base learners are built with proper subsamples. In this work, we demonstrate that by defining an appropriate grid structure on the covariate space, we may carry out formal hypothesis tests for both variable importance and underlying additive model structure. To our knowledge, these tests represent the first statistical tools for investigating the underlying regression structure in a context such as random forests. We develop notions of total and partial additivity and further demonstrate that testing can be carried out at no additional computational cost by estimating the variance within the process of constructing the ensemble. Furthermore, we propose a novel extension of these testing procedures utilizing random projections in order to allow for computationally efficient testing procedures that retain high power even when the grid size is much larger than that of the training set.
虽然统计学习方法已被证明是预测建模的强大工具,但它们所产生模型的黑箱性质会严重限制其可解释性以及进行形式推断的能力。然而,当使用适当的子样本构建基学习器时,像袋装树和随机森林这样的集成学习器的自然结构已被证明具有理想的渐近性质。在这项工作中,我们证明通过在协变量空间上定义适当的网格结构,我们可以对变量重要性和潜在的加性模型结构进行形式假设检验。据我们所知,这些检验代表了在诸如随机森林这样的背景下研究潜在回归结构的首批统计工具。我们提出了完全和部分可加性的概念,并进一步证明通过在构建集成的过程中估计方差,可以在不增加额外计算成本的情况下进行检验。此外,我们提出了这些检验程序的一种新颖扩展,利用随机投影,以便在网格大小远大于训练集大小时仍能实现计算高效的检验程序,同时保持高功效。