Suppr超能文献

基于交叉验证特征选择的梯度提升树中的特征重要性

Feature Importance in Gradient Boosting Trees with Cross-Validation Feature Selection.

作者信息

Adler Afek Ilay, Painsky Amichai

机构信息

The Industrial Engineering Department, Tel Aviv University, Tel Aviv 69978, Israel.

出版信息

Entropy (Basel). 2022 May 13;24(5):687. doi: 10.3390/e24050687.

Abstract

Gradient Boosting Machines (GBM) are among the go-to algorithms on tabular data, which produce state-of-the-art results in many prediction tasks. Despite its popularity, the GBM framework suffers from a fundamental flaw in its base learners. Specifically, most implementations utilize decision trees that are typically biased towards categorical variables with large cardinalities. The effect of this bias was extensively studied over the years, mostly in terms of predictive performance. In this work, we extend the scope and study the effect of biased base learners on GBM feature importance (FI) measures. We demonstrate that although these implementation demonstrate highly competitive predictive performance, they still, surprisingly, suffer from bias in FI. By utilizing cross-validated (CV) unbiased base learners, we fix this flaw at a relatively low computational cost. We demonstrate the suggested framework in a variety of synthetic and real-world setups, showing a significant improvement in all GBM FI measures while maintaining relatively the same level of prediction accuracy.

摘要

梯度提升机(GBM)是处理表格数据时常用的算法之一,在许多预测任务中能产生最先进的结果。尽管GBM很受欢迎,但其框架在基础学习器方面存在一个根本性缺陷。具体而言,大多数实现使用的决策树通常偏向于基数较大的分类变量。多年来,人们广泛研究了这种偏差的影响,主要是从预测性能的角度。在这项工作中,我们扩展了研究范围,研究有偏差的基础学习器对GBM特征重要性(FI)度量的影响。我们证明,尽管这些实现展示了极具竞争力的预测性能,但令人惊讶的是,它们在FI方面仍然存在偏差。通过使用交叉验证(CV)无偏基础学习器,我们以相对较低的计算成本修复了这个缺陷。我们在各种合成和实际设置中展示了所建议的框架,表明在所有GBM FI度量方面都有显著改进,同时保持了相对相同的预测精度水平。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/71c5/9140774/cc674c26a110/entropy-24-00687-g003.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验