Suppr超能文献

基于树的方法中的交叉验证变量选择可提高预测性能。

Cross-Validated Variable Selection in Tree-Based Methods Improves Predictive Performance.

出版信息

IEEE Trans Pattern Anal Mach Intell. 2017 Nov;39(11):2142-2153. doi: 10.1109/TPAMI.2016.2636831. Epub 2016 Dec 7.

Abstract

Recursive partitioning methods producing tree-like models are a long standing staple of predictive modeling. However, a fundamental flaw in the partitioning (or splitting) rule of commonly used tree building methods precludes them from treating different types of variables equally. This most clearly manifests in these methods' inability to properly utilize categorical variables with a large number of categories, which are ubiquitous in the new age of big data. We propose a framework to splitting using leave-one-out (LOO) cross validation (CV) for selecting the splitting variable, then performing a regular split (in our case, following CART's approach) for the selected variable. The most important consequence of our approach is that categorical variables with many categories can be safely used in tree building and are only chosen if they contribute to predictive power. We demonstrate in extensive simulation and real data analysis that our splitting approach significantly improves the performance of both single tree models and ensemble methods that utilize trees. Importantly, we design an algorithm for LOO splitting variable selection which under reasonable assumptions does not substantially increase the overall computational complexity compared to CART for two-class classification.

摘要

递归分区方法生成树状模型是预测建模的长期基础。然而,常用树构建方法的分区(或分裂)规则存在一个根本缺陷,使它们无法平等对待不同类型的变量。这在这些方法无法正确使用具有大量类别的分类变量时最为明显,而这些分类变量在大数据的新时代中无处不在。我们提出了一种使用留一法交叉验证(LOO CV)来选择分裂变量的框架,然后对所选变量执行常规分裂(在我们的案例中,遵循 CART 的方法)。我们方法的最重要结果是,具有许多类别的分类变量可以安全地用于树构建,并且只有在它们有助于预测能力时才会被选择。我们在广泛的模拟和真实数据分析中证明,我们的分裂方法显著提高了单树模型和利用树的集成方法的性能。重要的是,我们设计了一种用于 LOO 分裂变量选择的算法,在合理的假设下,与 CART 相比,对于两类分类,它不会显著增加整体计算复杂度。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验