对分类预测变量进行袋外编码会影响袋外样本。

Out of (the) bag-encoding categorical predictors impacts out-of-bag samples.

作者信息

Smith Helen L, Biggs Patrick J, French Nigel P, Smith Adam N H, Marshall Jonathan C

机构信息

School of Mathematical and Computational Sciences, Massey University, Palmerston North, New Zealand.

School of Food Technology and Natural Sciences, Massey University, Palmerston North, New Zealand.

出版信息

PeerJ Comput Sci. 2024 Nov 18;10:e2445. doi: 10.7717/peerj-cs.2445. eCollection 2024.

DOI:10.7717/peerj-cs.2445

PMID:39650463

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11623134/

Abstract

Performance of random forest classification models is often assessed and interpreted using out-of-bag (OOB) samples. Observations which are OOB when a tree is trained may serve as a test set for that tree and predictions from the OOB observations used to calculate OOB error and variable importance measures (VIM). OOB errors are popular because they are fast to compute and, for large samples, are a good estimate of the true prediction error. In this study, we investigate how target-based . target-agnostic encoding of categorical predictor variables for random forest can bias performance measures based on OOB samples. We show that, when categorical variables are encoded using a target-based encoding method, and when the encoding takes place prior to bagging, the OOB sample can underestimate the true misclassification rate, and overestimate variable importance. We recommend using a separate test data set when evaluating variable importance and/or predictive performance of tree based methods that utilise a target-based encoding method.

摘要

随机森林分类模型的性能通常使用袋外（OOB）样本进行评估和解释。在训练一棵树时属于OOB的观测值可作为该树的测试集，并且来自OOB观测值的预测用于计算OOB误差和变量重要性度量（VIM）。OOB误差很受欢迎，因为它们计算速度快，并且对于大样本而言，是对真实预测误差的良好估计。在本研究中，我们调查了用于随机森林的分类预测变量基于目标和与目标无关的编码如何使基于OOB样本的性能度量产生偏差。我们表明，当使用基于目标的编码方法对分类变量进行编码时，并且当编码在装袋之前进行时，OOB样本可能会低估真实误分类率，并高估变量重要性。我们建议在评估使用基于目标的编码方法的基于树的方法的变量重要性和/或预测性能时使用单独的测试数据集。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/107b/11623134/3991dd0e06e2/peerj-cs-10-2445-g001.jpg

相似文献

Out of (the) bag-encoding categorical predictors impacts out-of-bag samples.对分类预测变量进行袋外编码会影响袋外样本。

PeerJ Comput Sci. 2024 Nov 18;10:e2445. doi: 10.7717/peerj-cs.2445. eCollection 2024.

Use and misuse of random forest variable importance metrics in medicine: demonstrations through incident stroke prediction.随机森林变量重要性度量在医学中的正确使用和误用：通过中风事件预测进行演示。

BMC Med Res Methodol. 2023 Jun 19;23(1):144. doi: 10.1186/s12874-023-01965-x.

On the overestimation of random forest's out-of-bag error.随机森林的袋外误差高估问题。

PLoS One. 2018 Aug 6;13(8):e0201904. doi: 10.1371/journal.pone.0201904. eCollection 2018.

Hybrid variable selection strategy coupled with random forest (RF) for quantitative analysis of methanol in methanol-gasoline via Raman spectroscopy.基于拉曼光谱的甲醇-汽油中甲醇的定量分析的混合变量选择策略与随机森林（RF）耦合。

Spectrochim Acta A Mol Biomol Spectrosc. 2021 Apr 15;251:119430. doi: 10.1016/j.saa.2021.119430. Epub 2021 Jan 5.

TRM: a powerful two-stage machine learning approach for identifying SNP-SNP interactions.TRM：一种用于识别单核苷酸多态性（SNP）-SNP相互作用的强大的两阶段机器学习方法。

Ann Hum Genet. 2012 Jan;76(1):53-62. doi: 10.1111/j.1469-1809.2011.00692.x. Epub 2011 Dec 11.

Predicting prognosis of endometrioid endometrial adenocarcinoma on the basis of gene expression and clinical features using Random Forest.利用随机森林基于基因表达和临床特征预测子宫内膜样腺癌的预后。

Oncol Lett. 2019 Aug;18(2):1597-1606. doi: 10.3892/ol.2019.10504. Epub 2019 Jun 20.

Addressing Measurement Error in Random Forests Using Quantitative Bias Analysis.利用定量偏差分析解决随机森林中的测量误差问题。

Am J Epidemiol. 2021 Sep 1;190(9):1830-1840. doi: 10.1093/aje/kwab010.

Causal Artificial Intelligence Models of Food Quality Data.食品质量数据的因果人工智能模型。

Food Technol Biotechnol. 2024 Mar;62(1):102-109. doi: 10.17113/ftb.62.01.24.8301.

Assessing the accuracy and stability of variable selection methods for random forest modeling in ecology.评估生态随机森林建模中变量选择方法的准确性和稳定性。

Environ Monit Assess. 2017 Jul;189(7):316. doi: 10.1007/s10661-017-6025-0. Epub 2017 Jun 6.

Feature Selection Algorithm based on Random Forest applied to Sleep Apnea Detection.基于随机森林的特征选择算法在睡眠呼吸暂停检测中的应用

Annu Int Conf IEEE Eng Med Biol Soc. 2019 Jul;2019:2580-2583. doi: 10.1109/EMBC.2019.8856582.

引用本文的文献

INFO-RF-based fault diagnosis and analysis method for busbars.基于信息射频的母线故障诊断与分析方法

Sci Rep. 2025 Jul 2;15(1):23502. doi: 10.1038/s41598-025-07402-x.

Ranking Nursing Diagnoses by Predictive Relevance for Intensive Care Unit Transfer Risk in Adult and Pediatric Patients: A Machine Learning Approach with Random Forest.通过预测相关性对成人和儿科患者重症监护病房转运风险进行护理诊断排序：一种基于随机森林的机器学习方法

Healthcare (Basel). 2025 Jun 4;13(11):1339. doi: 10.3390/healthcare13111339.

本文引用的文献

A general framework for inference on algorithm-agnostic variable importance.一种用于推断与算法无关的变量重要性的通用框架。

J Am Stat Assoc. 2023;118(543):1645-1658. doi: 10.1080/01621459.2021.2003200. Epub 2022 Jan 5.

BMC Med Res Methodol. 2023 Jun 19;23(1):144. doi: 10.1186/s12874-023-01965-x.

Optimal Tuning of Random Survival Forest Hyperparameter with an Application to Liver Disease.随机生存森林超参数的优化调整及其在肝脏疾病中的应用

Malays J Med Sci. 2022 Dec;29(6):67-76. doi: 10.21315/mjms2022.29.6.7. Epub 2022 Dec 22.

Feature Selection Algorithm based on Random Forest applied to Sleep Apnea Detection.基于随机森林的特征选择算法在睡眠呼吸暂停检测中的应用

Annu Int Conf IEEE Eng Med Biol Soc. 2019 Jul;2019:2580-2583. doi: 10.1109/EMBC.2019.8856582.

Splitting on categorical predictors in random forests.随机森林中对分类预测变量进行划分。

PeerJ. 2019 Feb 7;7:e6339. doi: 10.7717/peerj.6339. eCollection 2019.

On the overestimation of random forest's out-of-bag error.随机森林的袋外误差高估问题。

PLoS One. 2018 Aug 6;13(8):e0201904. doi: 10.1371/journal.pone.0201904. eCollection 2018.

The revival of the Gini importance?基尼重要性的复兴？

Bioinformatics. 2018 Nov 1;34(21):3711-3718. doi: 10.1093/bioinformatics/bty373.

Intervention in prediction measure: a new approach to assessing variable importance for random forests.预测度量中的干预：一种评估随机森林变量重要性的新方法。

BMC Bioinformatics. 2017 May 2;18(1):230. doi: 10.1186/s12859-017-1650-8.

r2VIM: A new variable selection method for random forests in genome-wide association studies.r2VIM：全基因组关联研究中随机森林的一种新变量选择方法。

BioData Min. 2016 Feb 1;9:7. doi: 10.1186/s13040-016-0087-3. eCollection 2016.

AUC-RF: a new strategy for genomic profiling with random forest.AUC-RF：一种使用随机森林进行基因组分析的新策略。

Hum Hered. 2011;72(2):121-32. doi: 10.1159/000330778. Epub 2011 Oct 11.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

对分类预测变量进行袋外编码会影响袋外样本。

Out of (the) bag-encoding categorical predictors impacts out-of-bag samples.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献