Radiation Epidemiology Branch, National Cancer Institute, Bethesda, MD, 20892-9778, USA.
Radiation Epidemiology Branch, Division of Cancer Epidemiology and Genetics, Department of Health and Human Services, National Cancer Institute, National Institutes of Health, 9609 Medical Center Drive, Bethesda, MD, 20892-9778, USA.
Sci Rep. 2022 Sep 6;12(1):15113. doi: 10.1038/s41598-022-19281-7.
Random forests are a popular type of machine learning model, which are relatively robust to overfitting, unlike some other machine learning models, and adequately capture non-linear relationships between an outcome of interest and multiple independent variables. There are relatively few adjustable hyperparameters in the standard random forest models, among them the minimum size of the terminal nodes on each tree. The usual stopping rule, as proposed by Breiman, stops tree expansion by limiting the size of the parent nodes, so that a node cannot be split if it has less than a specified number of observations. Recently an alternative stopping criterion has been proposed, stopping tree expansion so that all terminal nodes have at least a minimum number of observations. The present paper proposes three generalisations of this idea, limiting the growth in regression random forests, based on the variance, range, or inter-centile range. The new approaches are applied to diabetes data obtained from the National Health and Nutrition Examination Survey and four other datasets (Tasmanian Abalone data, Boston Housing crime rate data, Los Angeles ozone concentration data, MIT servo data). Empirical analysis presented herein demonstrate that the new stopping rules yield competitive mean square prediction error to standard random forest models. In general, use of the intercentile range statistic to control tree expansion yields much less variation in mean square prediction error, and mean square prediction error is also closer to the optimal. The Fortran code developed is provided in the Supplementary Material.
随机森林是一种流行的机器学习模型,与其他一些机器学习模型相比,它对过拟合具有较强的稳健性,并能充分捕捉感兴趣的结果与多个独立变量之间的非线性关系。标准随机森林模型中的可调超参数相对较少,其中包括每棵树终端节点的最小大小。通常的停止规则(由 Breiman 提出)通过限制父节点的大小来停止树的扩展,因此如果一个节点的观测值少于指定数量,则不能进行分割。最近提出了一种替代的停止标准,即停止树的扩展,以使所有终端节点至少具有指定数量的观测值。本文提出了三种基于方差、范围或分位数内距的这种思想的推广,用于限制回归随机森林的增长。新方法应用于从国家健康和营养检查调查以及其他四个数据集(塔斯马尼亚鲍鱼数据、波士顿住房犯罪率数据、洛杉矶臭氧浓度数据、麻省理工学院伺服数据)中获得的糖尿病数据。本文提出的实证分析表明,新的停止规则产生的均方预测误差与标准随机森林模型相当。一般来说,使用分位数范围统计量来控制树的扩展会导致均方预测误差的变化更小,并且均方预测误差也更接近最优。所开发的 Fortran 代码在补充材料中提供。