Vaccine and Infectious Disease Division, Fred Hutchinson Cancer Research Center, Seattle, USA.
BMC Med Inform Decis Mak. 2021 Nov 22;21(1):322. doi: 10.1186/s12911-021-01688-3.
BACKGROUND: While random forests are one of the most successful machine learning methods, it is necessary to optimize their performance for use with datasets resulting from a two-phase sampling design with a small number of cases-a common situation in biomedical studies, which often have rare outcomes and covariates whose measurement is resource-intensive. METHODS: Using an immunologic marker dataset from a phase III HIV vaccine efficacy trial, we seek to optimize random forest prediction performance using combinations of variable screening, class balancing, weighting, and hyperparameter tuning. RESULTS: Our experiments show that while class balancing helps improve random forest prediction performance when variable screening is not applied, class balancing has a negative impact on performance in the presence of variable screening. The impact of the weighting similarly depends on whether variable screening is applied. Hyperparameter tuning is ineffective in situations with small sample sizes. We further show that random forests under-perform generalized linear models for some subsets of markers, and prediction performance on this dataset can be improved by stacking random forests and generalized linear models trained on different subsets of predictors, and that the extent of improvement depends critically on the dissimilarities between candidate learner predictions. CONCLUSION: In small datasets from two-phase sampling design, variable screening and inverse sampling probability weighting are important for achieving good prediction performance of random forests. In addition, stacking random forests and simple linear models can offer improvements over random forests.
背景:虽然随机森林是最成功的机器学习方法之一,但对于使用两阶段抽样设计得到的数据集,需要优化其性能,这些数据集的案例数量较少——这在生物医学研究中很常见,因为此类研究通常存在罕见的结果和测量资源密集型的协变量。
方法:我们使用一项 III 期 HIV 疫苗功效试验的免疫标志物数据集,旨在通过变量筛选、类别平衡、加权和超参数调整的组合,优化随机森林预测性能。
结果:我们的实验表明,当不应用变量筛选时,类别平衡有助于提高随机森林的预测性能,但在存在变量筛选时,类别平衡会对性能产生负面影响。加权的影响同样取决于是否应用变量筛选。在样本量较小的情况下,超参数调整无效。我们进一步表明,对于某些标志物子集,随机森林的性能逊于广义线性模型,并且通过堆叠在不同预测器子集上训练的随机森林和广义线性模型,可以提高该数据集的预测性能,而改进的程度取决于候选学习者预测之间的差异。
结论:在两阶段抽样设计的小数据集,变量筛选和逆抽样概率加权对于实现随机森林的良好预测性能很重要。此外,堆叠随机森林和简单线性模型可以提供比随机森林更好的效果。
BMC Med Inform Decis Mak. 2021-11-22
BMC Med Res Methodol. 2021-9-25
BMC Med Inform Decis Mak. 2022-10-25
BMC Bioinformatics. 2019-6-27
Am J Epidemiol. 2021-9-1
BMC Med Inform Decis Mak. 2023-5-22
NPJ Digit Med. 2025-7-5
BMC Psychiatry. 2025-1-9
Vaccines (Basel). 2024-11-28
J Clin Invest. 2019-11-1
Clin Infect Dis. 2018-1-6
Comput Struct Biotechnol J. 2014-11-15
N Engl J Med. 2013-10-7