基于交叉验证特征选择的梯度提升树中的特征重要性

Feature Importance in Gradient Boosting Trees with Cross-Validation Feature Selection.

作者信息

Adler Afek Ilay, Painsky Amichai

机构信息

The Industrial Engineering Department, Tel Aviv University, Tel Aviv 69978, Israel.

出版信息

Entropy (Basel). 2022 May 13;24(5):687. doi: 10.3390/e24050687.

DOI:10.3390/e24050687

PMID:35626570

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC9140774/

Abstract

Gradient Boosting Machines (GBM) are among the go-to algorithms on tabular data, which produce state-of-the-art results in many prediction tasks. Despite its popularity, the GBM framework suffers from a fundamental flaw in its base learners. Specifically, most implementations utilize decision trees that are typically biased towards categorical variables with large cardinalities. The effect of this bias was extensively studied over the years, mostly in terms of predictive performance. In this work, we extend the scope and study the effect of biased base learners on GBM feature importance (FI) measures. We demonstrate that although these implementation demonstrate highly competitive predictive performance, they still, surprisingly, suffer from bias in FI. By utilizing cross-validated (CV) unbiased base learners, we fix this flaw at a relatively low computational cost. We demonstrate the suggested framework in a variety of synthetic and real-world setups, showing a significant improvement in all GBM FI measures while maintaining relatively the same level of prediction accuracy.

摘要

梯度提升机（GBM）是处理表格数据时常用的算法之一，在许多预测任务中能产生最先进的结果。尽管GBM很受欢迎，但其框架在基础学习器方面存在一个根本性缺陷。具体而言，大多数实现使用的决策树通常偏向于基数较大的分类变量。多年来，人们广泛研究了这种偏差的影响，主要是从预测性能的角度。在这项工作中，我们扩展了研究范围，研究有偏差的基础学习器对GBM特征重要性（FI）度量的影响。我们证明，尽管这些实现展示了极具竞争力的预测性能，但令人惊讶的是，它们在FI方面仍然存在偏差。通过使用交叉验证（CV）无偏基础学习器，我们以相对较低的计算成本修复了这个缺陷。我们在各种合成和实际设置中展示了所建议的框架，表明在所有GBM FI度量方面都有显著改进，同时保持了相对相同的预测精度水平。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/71c5/9140774/cc674c26a110/entropy-24-00687-g003.jpg

相似文献

Feature Importance in Gradient Boosting Trees with Cross-Validation Feature Selection.基于交叉验证特征选择的梯度提升树中的特征重要性

Entropy (Basel). 2022 May 13;24(5):687. doi: 10.3390/e24050687.

Randomized boosting with multivariable base-learners for high-dimensional variable selection and prediction.基于多变量基学习器的随机boosting 算法在高维变量选择和预测中的应用。

BMC Bioinformatics. 2021 Sep 16;22(1):441. doi: 10.1186/s12859-021-04340-z.

Cross-Validated Variable Selection in Tree-Based Methods Improves Predictive Performance.基于树的方法中的交叉验证变量选择可提高预测性能。

IEEE Trans Pattern Anal Mach Intell. 2017 Nov;39(11):2142-2153. doi: 10.1109/TPAMI.2016.2636831. Epub 2016 Dec 7.

Real-time milk analysis integrated with stacking ensemble learning as a tool for the daily prediction of cheese-making traits in Holstein cattle.将实时牛奶分析与堆叠集成学习相结合，作为预测荷斯坦奶牛奶酪制作特性的日常工具。

J Dairy Sci. 2022 May;105(5):4237-4255. doi: 10.3168/jds.2021-21426. Epub 2022 Mar 10.

Cross-validated tree-based models for multi-target learning.用于多目标学习的交叉验证树模型。

Front Artif Intell. 2024 Feb 16;7:1302860. doi: 10.3389/frai.2024.1302860. eCollection 2024.

Evaluating the performance of machine learning methods and variable selection methods for predicting difficult-to-measure traits in Holstein dairy cattle using milk infrared spectral data.利用牛奶近红外光谱数据评估机器学习方法和变量选择方法在荷斯坦奶牛中预测难以测量性状的性能。

J Dairy Sci. 2021 Jul;104(7):8107-8121. doi: 10.3168/jds.2020-19861. Epub 2021 Apr 15.

Predicting potential miRNA-disease associations by combining gradient boosting decision tree with logistic regression.通过梯度提升决策树与逻辑回归相结合来预测潜在的 miRNA-疾病关联。

Comput Biol Chem. 2020 Apr;85:107200. doi: 10.1016/j.compbiolchem.2020.107200. Epub 2020 Jan 28.

Boosting for high-dimensional two-class prediction.用于高维二类预测的提升算法。

BMC Bioinformatics. 2015 Sep 21;16:300. doi: 10.1186/s12859-015-0723-9.

SAINTENS: Self-Attention and Intersample Attention Transformer for Digital Biomarker Development Using Tabular Healthcare Real World Data.使用表格型医疗保健真实世界数据开发数字生物标志物的自注意和样本间注意转换器（SAINTENS）

Stud Health Technol Inform. 2022 May 16;293:212-220. doi: 10.3233/SHTI220371.

Deselection of base-learners for statistical boosting-with an application to distributional regression.用于统计增强的基学习器选择——及其在分布回归中的应用

Stat Methods Med Res. 2022 Feb;31(2):207-224. doi: 10.1177/09622802211051088. Epub 2021 Dec 9.

引用本文的文献

Predicting nepetalactone accumulation in Nepeta persica using machine learning algorithms and geospatial analysis.使用机器学习算法和地理空间分析预测波斯荆芥中荆芥内酯的积累。

Sci Rep. 2025 Aug 27;15(1):31535. doi: 10.1038/s41598-025-17039-5.

Digital biomarkers for interstitial glucose prediction in healthy individuals using wearables and machine learning.使用可穿戴设备和机器学习预测健康个体间质葡萄糖的数字生物标志物。

Sci Rep. 2025 Aug 18;15(1):30164. doi: 10.1038/s41598-025-14172-z.

Machine learning models for predicting multimorbidity trajectories in middle-aged and elderly adults.用于预测中老年人群多种疾病发展轨迹的机器学习模型。

Sci Rep. 2025 Jul 9;15(1):24711. doi: 10.1038/s41598-025-07060-z.

Application of Mask R-CNN for automatic recognition of teeth and caries in cone-beam computerized tomography.Mask R-CNN在锥束计算机断层扫描中用于牙齿和龋齿自动识别的应用

BMC Oral Health. 2025 Jun 6;25(1):927. doi: 10.1186/s12903-025-06293-8.

Predicting survival in small cell lung cancer patients undergoing various treatments: a machine learning approach.预测接受各种治疗的小细胞肺癌患者的生存率：一种机器学习方法。

Transl Lung Cancer Res. 2025 Mar 31;14(3):736-748. doi: 10.21037/tlcr-24-331. Epub 2025 Mar 14.

Exploiting Data Distribution: A Multi-Ranking Approach.利用数据分布：一种多排序方法。

Entropy (Basel). 2025 Mar 7;27(3):278. doi: 10.3390/e27030278.

Developing clinical prognostic models to predict graft survival after renal transplantation: comparison of statistical and machine learning models.开发临床预后模型以预测肾移植后的移植物存活：统计模型与机器学习模型的比较

BMC Med Inform Decis Mak. 2025 Feb 3;25(1):54. doi: 10.1186/s12911-025-02906-y.

Machine learning-driven simplification of the hypomania checklist-32 for adolescent: a feature selection approach.机器学习驱动的青少年轻躁狂检查表-32简化：一种特征选择方法。

Int J Bipolar Disord. 2024 Dec 18;12(1):42. doi: 10.1186/s40345-024-00365-4.

Combined Method Comprising Low Burden Physiological Measurements with Dry Electrodes and Machine Learning for Classification of Visually Induced Motion Sickness in Remote-Controlled Excavator.结合使用带有干电极的低负担生理测量和机器学习的方法，对遥控挖掘机中的视觉诱发运动病进行分类。

Sensors (Basel). 2024 Oct 7;24(19):6465. doi: 10.3390/s24196465.

Features that influence bike sharing demand.影响共享单车需求的特征。

Heliyon. 2024 Sep 10;10(18):e37608. doi: 10.1016/j.heliyon.2024.e37608. eCollection 2024 Sep 30.

本文引用的文献

Explainable machine-learning predictions for the prevention of hypoxaemia during surgery.用于预防手术期间低氧血症的可解释机器学习预测。

Nat Biomed Eng. 2018 Oct;2(10):749-760. doi: 10.1038/s41551-018-0304-0. Epub 2018 Oct 10.

Cross-Validated Variable Selection in Tree-Based Methods Improves Predictive Performance.基于树的方法中的交叉验证变量选择可提高预测性能。

IEEE Trans Pattern Anal Mach Intell. 2017 Nov;39(11):2142-2153. doi: 10.1109/TPAMI.2016.2636831. Epub 2016 Dec 7.

Comments on Fifty Years of Classification and Regression Trees.关于分类与回归树五十年的评论

Int Stat Rev. 2014 Dec 1;82(3):359-361. doi: 10.1111/insr.12060.

The behaviour of random forest permutation-based variable importance measures under predictor correlation.随机森林排列重要性度量在预测变量相关性下的行为。

BMC Bioinformatics. 2010 Feb 27;11:110. doi: 10.1186/1471-2105-11-110.

Bias in random forest variable importance measures: illustrations, sources and a solution.随机森林变量重要性度量中的偏差：示例、来源及解决方案

BMC Bioinformatics. 2007 Jan 25;8:25. doi: 10.1186/1471-2105-8-25.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

基于交叉验证特征选择的梯度提升树中的特征重要性

Feature Importance in Gradient Boosting Trees with Cross-Validation Feature Selection.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献