Department of Decision Sciences, HEC Montreal, Canada.
Stat Methods Med Res. 2020 Jan;29(1):205-229. doi: 10.1177/0962280219829885. Epub 2019 Feb 21.
The classical and most commonly used approach to building prediction intervals is the parametric approach. However, its main drawback is that its validity and performance highly depend on the assumed functional link between the covariates and the response. This research investigates new methods that improve the performance of prediction intervals with random forests. Two aspects are explored: The method used to build the forest and the method used to build the prediction interval. Four methods to build the forest are investigated, three from the classification and regression tree (CART) paradigm and the transformation forest method. For CART forests, in addition to the default least-squares splitting rule, two alternative splitting criteria are investigated. We also present and evaluate the performance of five flexible methods for constructing prediction intervals. This yields 20 distinct method variations. To reliably attain the desired confidence level, we include a calibration procedure performed on the out-of-bag information provided by the forest. The 20 method variations are thoroughly investigated, and compared to five alternative methods through simulation studies and in real data settings. The results show that the proposed methods are very competitive. They outperform commonly used methods in both in simulation settings and with real data.
构建预测区间的经典且最常用的方法是参数方法。然而,它的主要缺点是其有效性和性能高度依赖于协变量和响应之间假设的功能联系。本研究探讨了使用随机森林改进预测区间性能的新方法。探讨了两个方面:构建森林的方法和构建预测区间的方法。研究了四种构建森林的方法,其中三种来自分类和回归树 (CART) 范例和转换森林方法。对于 CART 森林,除了默认的最小二乘分裂规则外,还研究了两种替代的分裂标准。我们还提出并评估了用于构建预测区间的五种灵活方法的性能。这产生了 20 种不同的方法变体。为了可靠地达到所需的置信水平,我们包括在森林提供的袋外信息上执行的校准过程。彻底研究了这 20 种方法变体,并通过模拟研究和真实数据环境与五种替代方法进行了比较。结果表明,所提出的方法非常有竞争力。它们在模拟环境和真实数据中都优于常用方法。