Suppr超能文献

基于交叉验证特征选择的梯度提升树中的特征重要性

Feature Importance in Gradient Boosting Trees with Cross-Validation Feature Selection.

作者信息

Adler Afek Ilay, Painsky Amichai

机构信息

The Industrial Engineering Department, Tel Aviv University, Tel Aviv 69978, Israel.

出版信息

Entropy (Basel). 2022 May 13;24(5):687. doi: 10.3390/e24050687.

Abstract

Gradient Boosting Machines (GBM) are among the go-to algorithms on tabular data, which produce state-of-the-art results in many prediction tasks. Despite its popularity, the GBM framework suffers from a fundamental flaw in its base learners. Specifically, most implementations utilize decision trees that are typically biased towards categorical variables with large cardinalities. The effect of this bias was extensively studied over the years, mostly in terms of predictive performance. In this work, we extend the scope and study the effect of biased base learners on GBM feature importance (FI) measures. We demonstrate that although these implementation demonstrate highly competitive predictive performance, they still, surprisingly, suffer from bias in FI. By utilizing cross-validated (CV) unbiased base learners, we fix this flaw at a relatively low computational cost. We demonstrate the suggested framework in a variety of synthetic and real-world setups, showing a significant improvement in all GBM FI measures while maintaining relatively the same level of prediction accuracy.

摘要

梯度提升机(GBM)是处理表格数据时常用的算法之一,在许多预测任务中能产生最先进的结果。尽管GBM很受欢迎,但其框架在基础学习器方面存在一个根本性缺陷。具体而言,大多数实现使用的决策树通常偏向于基数较大的分类变量。多年来,人们广泛研究了这种偏差的影响,主要是从预测性能的角度。在这项工作中,我们扩展了研究范围,研究有偏差的基础学习器对GBM特征重要性(FI)度量的影响。我们证明,尽管这些实现展示了极具竞争力的预测性能,但令人惊讶的是,它们在FI方面仍然存在偏差。通过使用交叉验证(CV)无偏基础学习器,我们以相对较低的计算成本修复了这个缺陷。我们在各种合成和实际设置中展示了所建议的框架,表明在所有GBM FI度量方面都有显著改进,同时保持了相对相同的预测精度水平。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/71c5/9140774/cc674c26a110/entropy-24-00687-g003.jpg

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验