Neufeld Anna C, Gao Lucy L, Witten Daniela M
Department of Statistics, University of Washington, Seattle, WA 98195, USA.
Department of Statistics, University of British Columbia, Vancouver, British Columbia, V6T 1Z4, Canada.
J Mach Learn Res. 2022;23.
We consider conducting inference on the output of the Classification and Regression Tree (CART) (Breiman et al., 1984) algorithm. A naive approach to inference that does not account for the fact that the tree was estimated from the data will not achieve standard guarantees, such as Type 1 error rate control and nominal coverage. Thus, we propose a selective inference framework for conducting inference on a fitted CART tree. In a nutshell, we condition on the fact that the tree was estimated from the data. We propose a test for the difference in the mean response between a pair of terminal nodes that controls the selective Type 1 error rate, and a confidence interval for the mean response within a single terminal node that attains the nominal selective coverage. Efficient algorithms for computing the necessary conditioning sets are provided. We apply these methods in simulation and to a dataset involving the association between portion control interventions and caloric intake.
我们考虑对分类回归树(CART)(Breiman等人,1984)算法的输出进行推断。一种不考虑从数据中估计树这一事实的简单推断方法无法实现标准的保证,比如一类错误率控制和名义覆盖率。因此,我们提出了一个用于对拟合的CART树进行推断的选择性推断框架。简而言之,我们基于从数据中估计树这一事实进行条件设定。我们提出了一种用于控制选择性一类错误率的一对终端节点之间平均响应差异的检验方法,以及一个在单个终端节点内实现名义选择性覆盖率的平均响应置信区间。还提供了用于计算必要条件集的高效算法。我们将这些方法应用于模拟以及一个涉及份量控制干预与热量摄入之间关联的数据集。