Suppr超能文献

随机森林中对分类预测变量进行划分。

Splitting on categorical predictors in random forests.

作者信息

Wright Marvin N, König Inke R

机构信息

Leibniz Institute for Prevention Research and Epidemiology-BIPS, Bremen, Germany.

Institut für Medizinische Biometrie und Statistik, Universität zu Lübeck, Universitätsklinikum Schleswig-Holstein, Campus Lübeck, Lübeck, Germany.

出版信息

PeerJ. 2019 Feb 7;7:e6339. doi: 10.7717/peerj.6339. eCollection 2019.

Abstract

One reason for the widespread success of random forests (RFs) is their ability to analyze most datasets without preprocessing. For example, in contrast to many other statistical methods and machine learning approaches, no recoding such as dummy coding is required to handle ordinal and nominal predictors. The standard approach for nominal predictors is to consider all 2 - 1 2-partitions of the predictor categories. However, this exponential relationship produces a large number of potential splits to be evaluated, increasing computational complexity and restricting the possible number of categories in most implementations. For binary classification and regression, it was shown that ordering the predictor categories in each split leads to exactly the same splits as the standard approach. This reduces computational complexity because only - 1 splits have to be considered for a nominal predictor with categories. For multiclass classification and survival prediction no ordering method producing equivalent splits exists. We therefore propose to use a heuristic which orders the categories according to the first principal component of the weighted covariance matrix in multiclass classification and by log-rank scores in survival prediction. This ordering of categories can be done either in every split or a priori, that is, just once before growing the forest. With this approach, the nominal predictor can be treated as ordinal in the entire RF procedure, speeding up the computation and avoiding category limits. We compare the proposed methods with the standard approach, dummy coding and simply ignoring the nominal nature of the predictors in several simulation settings and on real data in terms of prediction performance and computational efficiency. We show that ordering the categories a priori is at least as good as the standard approach of considering all 2-partitions in all datasets considered, while being computationally faster. We recommend to use this approach as the default in RFs.

摘要

随机森林(RFs)广泛成功的一个原因是它们能够在不进行预处理的情况下分析大多数数据集。例如,与许多其他统计方法和机器学习方法不同,处理有序和名义预测变量时不需要进行诸如虚拟编码之类的重新编码。对于名义预测变量的标准方法是考虑预测变量类别的所有2 - 1个二分区。然而,这种指数关系会产生大量潜在的分割需要评估,增加了计算复杂度,并在大多数实现中限制了可能的类别数量。对于二元分类和回归,研究表明在每个分割中对预测变量类别进行排序会产生与标准方法完全相同的分割。这降低了计算复杂度,因为对于具有类别数的名义预测变量,只需要考虑 - 1个分割。对于多类分类和生存预测,不存在产生等效分割的排序方法。因此,我们建议使用一种启发式方法,在多类分类中根据加权协方差矩阵的第一主成分对类别进行排序,在生存预测中根据对数秩得分进行排序。这种类别排序可以在每个分割中进行,也可以先验地进行,即在构建森林之前只进行一次。通过这种方法,名义预测变量在整个随机森林过程中可以被视为有序变量,从而加快计算速度并避免类别限制。我们在几个模拟设置以及实际数据上,就预测性能和计算效率方面,将所提出的方法与标准方法、虚拟编码以及简单忽略预测变量的名义性质进行了比较。我们表明,先验地对类别进行排序在所有考虑的数据集中至少与考虑所有二分区的标准方法一样好,同时计算速度更快。我们建议在随机森林中默认使用这种方法。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/10a7/6368971/cb86aad26689/peerj-07-6339-g001.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验