Smith Helen L, Biggs Patrick J, French Nigel P, Smith Adam N H, Marshall Jonathan C
School of Mathematical and Computational Sciences, Massey University, Palmerston North, New Zealand.
School of Food Technology and Natural Sciences, Massey University, Palmerston North, New Zealand.
PeerJ Comput Sci. 2024 Nov 18;10:e2445. doi: 10.7717/peerj-cs.2445. eCollection 2024.
Performance of random forest classification models is often assessed and interpreted using out-of-bag (OOB) samples. Observations which are OOB when a tree is trained may serve as a test set for that tree and predictions from the OOB observations used to calculate OOB error and variable importance measures (VIM). OOB errors are popular because they are fast to compute and, for large samples, are a good estimate of the true prediction error. In this study, we investigate how target-based . target-agnostic encoding of categorical predictor variables for random forest can bias performance measures based on OOB samples. We show that, when categorical variables are encoded using a target-based encoding method, and when the encoding takes place prior to bagging, the OOB sample can underestimate the true misclassification rate, and overestimate variable importance. We recommend using a separate test data set when evaluating variable importance and/or predictive performance of tree based methods that utilise a target-based encoding method.
随机森林分类模型的性能通常使用袋外(OOB)样本进行评估和解释。在训练一棵树时属于OOB的观测值可作为该树的测试集,并且来自OOB观测值的预测用于计算OOB误差和变量重要性度量(VIM)。OOB误差很受欢迎,因为它们计算速度快,并且对于大样本而言,是对真实预测误差的良好估计。在本研究中,我们调查了用于随机森林的分类预测变量基于目标和与目标无关的编码如何使基于OOB样本的性能度量产生偏差。我们表明,当使用基于目标的编码方法对分类变量进行编码时,并且当编码在装袋之前进行时,OOB样本可能会低估真实误分类率,并高估变量重要性。我们建议在评估使用基于目标的编码方法的基于树的方法的变量重要性和/或预测性能时使用单独的测试数据集。