Suppr超能文献

对分类预测变量进行袋外编码会影响袋外样本。

Out of (the) bag-encoding categorical predictors impacts out-of-bag samples.

作者信息

Smith Helen L, Biggs Patrick J, French Nigel P, Smith Adam N H, Marshall Jonathan C

机构信息

School of Mathematical and Computational Sciences, Massey University, Palmerston North, New Zealand.

School of Food Technology and Natural Sciences, Massey University, Palmerston North, New Zealand.

出版信息

PeerJ Comput Sci. 2024 Nov 18;10:e2445. doi: 10.7717/peerj-cs.2445. eCollection 2024.

Abstract

Performance of random forest classification models is often assessed and interpreted using out-of-bag (OOB) samples. Observations which are OOB when a tree is trained may serve as a test set for that tree and predictions from the OOB observations used to calculate OOB error and variable importance measures (VIM). OOB errors are popular because they are fast to compute and, for large samples, are a good estimate of the true prediction error. In this study, we investigate how target-based . target-agnostic encoding of categorical predictor variables for random forest can bias performance measures based on OOB samples. We show that, when categorical variables are encoded using a target-based encoding method, and when the encoding takes place prior to bagging, the OOB sample can underestimate the true misclassification rate, and overestimate variable importance. We recommend using a separate test data set when evaluating variable importance and/or predictive performance of tree based methods that utilise a target-based encoding method.

摘要

随机森林分类模型的性能通常使用袋外(OOB)样本进行评估和解释。在训练一棵树时属于OOB的观测值可作为该树的测试集,并且来自OOB观测值的预测用于计算OOB误差和变量重要性度量(VIM)。OOB误差很受欢迎,因为它们计算速度快,并且对于大样本而言,是对真实预测误差的良好估计。在本研究中,我们调查了用于随机森林的分类预测变量基于目标和与目标无关的编码如何使基于OOB样本的性能度量产生偏差。我们表明,当使用基于目标的编码方法对分类变量进行编码时,并且当编码在装袋之前进行时,OOB样本可能会低估真实误分类率,并高估变量重要性。我们建议在评估使用基于目标的编码方法的基于树的方法的变量重要性和/或预测性能时使用单独的测试数据集。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/107b/11623134/3991dd0e06e2/peerj-cs-10-2445-g001.jpg

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验