文献检索文档翻译深度研究
Suppr Zotero 插件Zotero 插件
邀请有礼套餐&价格历史记录

新学期,新优惠

限时优惠:9月1日-9月22日

30天高级会员仅需29元

1天体验卡首发特惠仅需5.99元

了解详情
不再提醒
插件&应用
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
高级版
套餐订阅购买积分包
AI 工具
文献检索文档翻译深度研究
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2025

改进两阶段抽样设计中小数据集的随机森林预测。

Improving random forest predictions in small datasets from two-phase sampling designs.

机构信息

Vaccine and Infectious Disease Division, Fred Hutchinson Cancer Research Center, Seattle, USA.

出版信息

BMC Med Inform Decis Mak. 2021 Nov 22;21(1):322. doi: 10.1186/s12911-021-01688-3.


DOI:10.1186/s12911-021-01688-3
PMID:34809631
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC8607560/
Abstract

BACKGROUND: While random forests are one of the most successful machine learning methods, it is necessary to optimize their performance for use with datasets resulting from a two-phase sampling design with a small number of cases-a common situation in biomedical studies, which often have rare outcomes and covariates whose measurement is resource-intensive. METHODS: Using an immunologic marker dataset from a phase III HIV vaccine efficacy trial, we seek to optimize random forest prediction performance using combinations of variable screening, class balancing, weighting, and hyperparameter tuning. RESULTS: Our experiments show that while class balancing helps improve random forest prediction performance when variable screening is not applied, class balancing has a negative impact on performance in the presence of variable screening. The impact of the weighting similarly depends on whether variable screening is applied. Hyperparameter tuning is ineffective in situations with small sample sizes. We further show that random forests under-perform generalized linear models for some subsets of markers, and prediction performance on this dataset can be improved by stacking random forests and generalized linear models trained on different subsets of predictors, and that the extent of improvement depends critically on the dissimilarities between candidate learner predictions. CONCLUSION: In small datasets from two-phase sampling design, variable screening and inverse sampling probability weighting are important for achieving good prediction performance of random forests. In addition, stacking random forests and simple linear models can offer improvements over random forests.

摘要

背景:虽然随机森林是最成功的机器学习方法之一,但对于使用两阶段抽样设计得到的数据集,需要优化其性能,这些数据集的案例数量较少——这在生物医学研究中很常见,因为此类研究通常存在罕见的结果和测量资源密集型的协变量。

方法:我们使用一项 III 期 HIV 疫苗功效试验的免疫标志物数据集,旨在通过变量筛选、类别平衡、加权和超参数调整的组合,优化随机森林预测性能。

结果:我们的实验表明,当不应用变量筛选时,类别平衡有助于提高随机森林的预测性能,但在存在变量筛选时,类别平衡会对性能产生负面影响。加权的影响同样取决于是否应用变量筛选。在样本量较小的情况下,超参数调整无效。我们进一步表明,对于某些标志物子集,随机森林的性能逊于广义线性模型,并且通过堆叠在不同预测器子集上训练的随机森林和广义线性模型,可以提高该数据集的预测性能,而改进的程度取决于候选学习者预测之间的差异。

结论:在两阶段抽样设计的小数据集,变量筛选和逆抽样概率加权对于实现随机森林的良好预测性能很重要。此外,堆叠随机森林和简单线性模型可以提供比随机森林更好的效果。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a39a/8607560/340e0f383cf6/12911_2021_1688_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a39a/8607560/8af178c41feb/12911_2021_1688_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a39a/8607560/340e0f383cf6/12911_2021_1688_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a39a/8607560/8af178c41feb/12911_2021_1688_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a39a/8607560/340e0f383cf6/12911_2021_1688_Fig2_HTML.jpg

相似文献

[1]
Improving random forest predictions in small datasets from two-phase sampling designs.

BMC Med Inform Decis Mak. 2021-11-22

[2]
A comparative study of forest methods for time-to-event data: variable selection and predictive performance.

BMC Med Res Methodol. 2021-9-25

[3]
Machine learning algorithms for outcome prediction in (chemo)radiotherapy: An empirical comparison of classifiers.

Med Phys. 2018-6-13

[4]
Joint modeling strategy for using electronic medical records data to build machine learning models: an example of intracerebral hemorrhage.

BMC Med Inform Decis Mak. 2022-10-25

[5]
Block Forests: random forests for blocks of clinical and omics covariate data.

BMC Bioinformatics. 2019-6-27

[6]
Addressing Measurement Error in Random Forests Using Quantitative Bias Analysis.

Am J Epidemiol. 2021-9-1

[7]
Classification of imbalanced data using machine learning algorithms to predict the risk of renal graft failures in Ethiopia.

BMC Med Inform Decis Mak. 2023-5-22

[8]
A multicenter random forest model for effective prognosis prediction in collaborative clinical research network.

Artif Intell Med. 2020-3

[9]
Heterogeneous ensemble learning for enhanced crash forecasts - A frequentist and machine learning based stacking framework.

J Safety Res. 2023-2

[10]
Interpretability and Class Imbalance in Prediction Models for Pain Volatility in Manage My Pain App Users: Analysis Using Feature Selection and Majority Voting Methods.

JMIR Med Inform. 2019-11-20

引用本文的文献

[1]
Aboveground biomass estimation using multimodal remote sensing observations and machine learning in mixed temperate forest.

Sci Rep. 2025-8-24

[2]
How fast-and-frugal trees can inform diagnostic and intervention decisions for enhancing elite athlete performance.

PLoS One. 2025-8-18

[3]
Relative importance of temporal and location features in predicting smoking events.

NPJ Digit Med. 2025-7-5

[4]
Predicting cisplatin response in cholangiocarcinoma patients using chromosome pattern and related gene expression.

Sci Rep. 2025-7-1

[5]
Prescriptive Predictors of Mindfulness Ecological Momentary Intervention for Social Anxiety Disorder: Machine Learning Analysis of Randomized Controlled Trial Data.

JMIR Ment Health. 2025-5-13

[6]
Integration of machine learning and viscoelastic testing to improve survival prediction in horses experiencing acute abdominal pain at a veterinary teaching hospital.

Equine Vet J. 2025-4-24

[7]
Machine-learning-based cost prediction models for inpatients with mental disorders in China.

BMC Psychiatry. 2025-1-9

[8]
Vaccine Strategies Against RNA Viruses: Current Advances and Future Directions.

Vaccines (Basel). 2024-11-28

[9]
Gut metatranscriptomics based de novo assembly reveals microbial signatures predicting immunotherapy outcomes in non-small cell lung cancer.

J Transl Med. 2024-11-19

[10]
Navigating predictions at nanoscale: a comprehensive study of regression models in magnetic nanoparticle synthesis.

J Mater Chem B. 2024-12-11

本文引用的文献

[1]
Antibody Fc effector functions and IgG3 associate with decreased HIV-1 risk.

J Clin Invest. 2019-11-1

[2]
Predicting breast cancer metastasis by using serum biomarkers and clinicopathological data with machine learning technologies.

Int J Med Inform. 2019-5-7

[3]
Modification of the Association Between T-Cell Immune Responses and Human Immunodeficiency Virus Type 1 Infection Risk by Vaccine-Induced Antibody Responses in the HVTN 505 Trial.

J Infect Dis. 2018-3-28

[4]
Machine Learning for Healthcare: On the Verge of a Major Shift in Healthcare Epidemiology.

Clin Infect Dis. 2018-1-6

[5]
Higher T-Cell Responses Induced by DNA/rAd5 HIV-1 Preventive Vaccine Are Associated With Lower HIV-1 Infection Risk in an Efficacy Trial.

J Infect Dis. 2017-5-1

[6]
Evaluating and comparing biomarkers with respect to the area under the receiver operating characteristics curve in two-phase case-control studies.

Biostatistics. 2016-7

[7]
Predicting protein-protein interactions from primary protein sequences using a novel multi-scale local feature representation scheme and the random forest.

PLoS One. 2015-5-6

[8]
Machine learning applications in cancer prognosis and prediction.

Comput Struct Biotechnol J. 2014-11-15

[9]
How to interpret a small increase in AUC with an additional risk prediction marker: decision analysis comes through.

Stat Med. 2014-5-13

[10]
Efficacy trial of a DNA/rAd5 HIV-1 preventive vaccine.

N Engl J Med. 2013-10-7

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

推荐工具

医学文档翻译智能文献检索