Barreñada Lasai, Dhiman Paula, Timmerman Dirk, Boulesteix Anne-Laure, Van Calster Ben
Department of Development and Regeneration, Leuven, KU, Belgium.
Leuven Unit for Health Technology Assessment Research (LUHTAR), Leuven, KU, Belgium.
Diagn Progn Res. 2024 Sep 27;8(1):14. doi: 10.1186/s41512-024-00177-1.
Random forests have become popular for clinical risk prediction modeling. In a case study on predicting ovarian malignancy, we observed training AUCs close to 1. Although this suggests overfitting, performance was competitive on test data. We aimed to understand the behavior of random forests for probability estimation by (1) visualizing data space in three real-world case studies and (2) a simulation study.
For the case studies, multinomial risk estimates were visualized using heatmaps in a 2-dimensional subspace. The simulation study included 48 logistic data-generating mechanisms (DGM), varying the predictor distribution, the number of predictors, the correlation between predictors, the true AUC, and the strength of true predictors. For each DGM, 1000 training datasets of size 200 or 4000 with binary outcomes were simulated, and random forest models were trained with minimum node size 2 or 20 using the ranger R package, resulting in 192 scenarios in total. Model performance was evaluated on large test datasets (N = 100,000).
The visualizations suggested that the model learned "spikes of probability" around events in the training set. A cluster of events created a bigger peak or plateau (signal), isolated events local peaks (noise). In the simulation study, median training AUCs were between 0.97 and 1 unless there were 4 binary predictors or 16 binary predictors with a minimum node size of 20. The median discrimination loss, i.e., the difference between the median test AUC and the true AUC, was 0.025 (range 0.00 to 0.13). Median training AUCs had Spearman correlations of around 0.70 with discrimination loss. Median test AUCs were higher with higher events per variable, higher minimum node size, and binary predictors. Median training calibration slopes were always above 1 and were not correlated with median test slopes across scenarios (Spearman correlation - 0.11). Median test slopes were higher with higher true AUC, higher minimum node size, and higher sample size.
Random forests learn local probability peaks that often yield near perfect training AUCs without strongly affecting AUCs on test data. When the aim is probability estimation, the simulation results go against the common recommendation to use fully grown trees in random forest models.
随机森林在临床风险预测建模中已变得很流行。在一项预测卵巢恶性肿瘤的案例研究中,我们观察到训练集的曲线下面积(AUC)接近1。尽管这表明存在过拟合,但模型在测试数据上的表现具有竞争力。我们旨在通过(1)在三个实际案例研究中可视化数据空间以及(2)进行模拟研究来了解随机森林在概率估计方面的行为。
对于案例研究,使用二维子空间中的热图来可视化多项风险估计。模拟研究包括48种逻辑数据生成机制(DGM),这些机制在预测变量分布、预测变量数量、预测变量之间的相关性、真实AUC以及真实预测变量的强度等方面有所不同。对于每种DGM,模拟了1000个大小为200或4000且具有二元结局的训练数据集,并使用R语言的ranger包以最小节点大小2或20训练随机森林模型,总共产生192种情况。在大型测试数据集(N = 100,000)上评估模型性能。
可视化结果表明,模型在训练集中围绕事件学习到了“概率峰值”。一组事件会产生更大的峰值或平台(信号),孤立事件则产生局部峰值(噪声)。在模拟研究中,除非有4个二元预测变量或16个最小节点大小为20的二元预测变量,否则训练集的中位数AUC在0.97至1之间。中位数判别损失,即中位数测试AUC与真实AUC之间的差异,为0.025(范围为0.00至0.13)。训练集的中位数AUC与判别损失的斯皮尔曼相关性约为0.70。每个变量的事件数越多、最小节点大小越高以及二元预测变量,测试集的中位数AUC越高。训练集的中位数校准斜率始终高于1,并且在不同情况下与测试集的中位数斜率不相关(斯皮尔曼相关性为 -0.11)。真实AUC越高、最小节点大小越高以及样本量越大,测试集的中位数斜率越高。
随机森林学习局部概率峰值,这通常会产生接近完美的训练集AUC,而不会对测试数据的AUC产生强烈影响。当目标是概率估计时,模拟结果与随机森林模型中使用完全生长的树的常见建议相悖。