Adler Afek Ilay, Painsky Amichai
The Industrial Engineering Department, Tel Aviv University, Tel Aviv 69978, Israel.
Entropy (Basel). 2022 May 13;24(5):687. doi: 10.3390/e24050687.
Gradient Boosting Machines (GBM) are among the go-to algorithms on tabular data, which produce state-of-the-art results in many prediction tasks. Despite its popularity, the GBM framework suffers from a fundamental flaw in its base learners. Specifically, most implementations utilize decision trees that are typically biased towards categorical variables with large cardinalities. The effect of this bias was extensively studied over the years, mostly in terms of predictive performance. In this work, we extend the scope and study the effect of biased base learners on GBM feature importance (FI) measures. We demonstrate that although these implementation demonstrate highly competitive predictive performance, they still, surprisingly, suffer from bias in FI. By utilizing cross-validated (CV) unbiased base learners, we fix this flaw at a relatively low computational cost. We demonstrate the suggested framework in a variety of synthetic and real-world setups, showing a significant improvement in all GBM FI measures while maintaining relatively the same level of prediction accuracy.
梯度提升机(GBM)是处理表格数据时常用的算法之一,在许多预测任务中能产生最先进的结果。尽管GBM很受欢迎,但其框架在基础学习器方面存在一个根本性缺陷。具体而言,大多数实现使用的决策树通常偏向于基数较大的分类变量。多年来,人们广泛研究了这种偏差的影响,主要是从预测性能的角度。在这项工作中,我们扩展了研究范围,研究有偏差的基础学习器对GBM特征重要性(FI)度量的影响。我们证明,尽管这些实现展示了极具竞争力的预测性能,但令人惊讶的是,它们在FI方面仍然存在偏差。通过使用交叉验证(CV)无偏基础学习器,我们以相对较低的计算成本修复了这个缺陷。我们在各种合成和实际设置中展示了所建议的框架,表明在所有GBM FI度量方面都有显著改进,同时保持了相对相同的预测精度水平。