基于聚类和纵向数据的医学预测模型的特征选择随机森林方法。

A random forest method with feature selection for developing medical prediction models with clustered and longitudinal data.

机构信息

Department of Biostatistics and Data Science, Wake Forest School of Medicine, Winston-Salem, NC 27157, USA.

出版信息

J Biomed Inform. 2021 May;117:103763. doi: 10.1016/j.jbi.2021.103763. Epub 2021 Mar 26.

DOI:10.1016/j.jbi.2021.103763

PMID:33781921

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC8131242/

Abstract

BACKGROUND

Machine learning methodologies are gaining popularity for developing medical prediction models for datasets with a large number of predictors, particularly in the setting of clustered and longitudinal data. Binary Mixed Model (BiMM) forest is a promising machine learning algorithm which may be applied to develop prediction models for clustered and longitudinal binary outcomes. Although machine learning methods for clustered and longitudinal methods such as BiMM forest exist, feature selection has not been analyzed via data simulations. Feature selection improves the practicality and ease of use of prediction models for clinicians by reducing the burden of data collection. Thus, feature selection procedures are not only beneficial, but are often necessary for development of medical prediction models. In this study, we aim to assess feature selection within the BiMM forest setting for modeling clustered and longitudinal binary outcomes.

METHODS

We conducted a simulation study to compare BiMM forest with feature selection (backward elimination or stepwise selection) to standard generalized linear mixed model feature selection methods (shrinkage and backward elimination). We also evaluated feature selection methods to develop models predicting mobility disability in older adults using the Health, Aging and Body Composition Study dataset as an example utilization of the proposed methodology.

RESULTS

BiMM forest with backward elimination generally offered higher computational efficiency, similar or higher predictive performance (accuracy and area under the receiver operating curve), and similar or higher ability to identify correct features compared to linear methods for the different simulated scenarios. For predicting mobility disability in older adults, methods generally performed similarly in terms of accuracy, area under the receiver operating curve, and specificity; however, BiMM forest with backward elimination had the highest sensitivity.

CONCLUSIONS

This study is novel because it is the first investigation of feature selection for developing random forest prediction models for clustered and longitudinal binary outcomes. Results from the simulation study reveal that BiMM forest with backward elimination has the highest accuracy (performance and identification of correct features) and lowest computation time compared to other feature selection methods in some scenarios and similar performance in other scenarios. Many informatics datasets have clustered and longitudinal outcomes and results from this study suggest that BiMM forest with backward elimination may be beneficial for developing medical prediction models.

摘要

背景

机器学习方法在具有大量预测因子的数据集上开发医学预测模型方面越来越受欢迎，特别是在聚类和纵向数据的情况下。二进制混合模型（BiMM）森林是一种很有前途的机器学习算法，可用于开发聚类和纵向二分类结局的预测模型。尽管存在用于聚类和纵向方法的机器学习方法，如 BiMM 森林，但尚未通过数据模拟分析特征选择。特征选择通过减少数据收集的负担，提高了预测模型对临床医生的实用性和易用性。因此，特征选择程序不仅有益，而且对于开发医学预测模型通常是必要的。在这项研究中，我们旨在评估 BiMM 森林中用于建模聚类和纵向二分类结局的特征选择。

方法

我们进行了一项模拟研究，将 BiMM 森林与特征选择（向后消除或逐步选择）与标准广义线性混合模型特征选择方法（收缩和向后消除）进行比较。我们还评估了特征选择方法，以使用健康、衰老和身体成分研究数据集为例，开发预测老年人移动障碍的模型，以展示所提出方法的应用。

结果

对于不同的模拟场景，与线性方法相比，BiMM 森林与向后消除相结合通常提供了更高的计算效率、相似或更高的预测性能（准确性和接收器工作曲线下面积）以及相似或更高的正确特征识别能力。对于预测老年人的移动障碍，各种方法在准确性、接收器工作曲线下面积和特异性方面的表现通常相似；然而，BiMM 森林与向后消除相结合的敏感性最高。

结论

这项研究是新颖的，因为它是首次针对开发聚类和纵向二分类结局的随机森林预测模型的特征选择进行的研究。模拟研究的结果表明，在某些情况下，与其他特征选择方法相比，BiMM 森林与向后消除相结合具有最高的准确性（性能和正确特征的识别）和最低的计算时间，而在其他情况下则具有相似的性能。许多信息学数据集具有聚类和纵向结果，本研究的结果表明，BiMM 森林与向后消除相结合可能有助于开发医学预测模型。

相似文献

A random forest method with feature selection for developing medical prediction models with clustered and longitudinal data.

J Biomed Inform. 2021 May;117:103763. doi: 10.1016/j.jbi.2021.103763. Epub 2021 Mar 26.

BiMM forest: A random forest method for modeling clustered and longitudinal binary outcomes.

Chemometr Intell Lab Syst. 2019 Feb 15;185:122-134. doi: 10.1016/j.chemolab.2019.01.002. Epub 2019 Jan 11.

Predicting daily outcomes in acetaminophen-induced acute liver failure patients with machine learning techniques.

Comput Methods Programs Biomed. 2019 Jul;175:111-120. doi: 10.1016/j.cmpb.2019.04.012. Epub 2019 Apr 11.

A Comparison of Random Forest Variable Selection Methods for Classification Prediction Modeling.

Expert Syst Appl. 2019 Nov 15;134:93-101. doi: 10.1016/j.eswa.2019.05.028. Epub 2019 May 23.

BiMM tree: A decision tree method for modeling clustered and longitudinal binary outcomes.

Commun Stat Simul Comput. 2020;49(4):1004-1023. doi: 10.1080/03610918.2018.1490429. Epub 2018 Sep 12.

Predicting Future Mobility Limitation in Older Adults: A Machine Learning Analysis of Health ABC Study Data.

J Gerontol A Biol Sci Med Sci. 2022 May 5;77(5):1072-1078. doi: 10.1093/gerona/glab269.

A comparative study of forest methods for time-to-event data: variable selection and predictive performance.

BMC Med Res Methodol. 2021 Sep 25;21(1):193. doi: 10.1186/s12874-021-01386-8.

Application of information theoretic feature selection and machine learning methods for the development of genetic risk prediction models.

Sci Rep. 2021 Dec 2;11(1):23335. doi: 10.1038/s41598-021-00854-x.

Comparison of machine learning models for predicting stroke risk in hypertensive patients: Lasso regression model, random forest model, Boruta algorithm model, and Boruta algorithm combined with Lasso regression model.

Medicine (Baltimore). 2025 May 30;104(22):e42690. doi: 10.1097/MD.0000000000042690.

Machine learning algorithms for outcome prediction in (chemo)radiotherapy: An empirical comparison of classifiers.

Med Phys. 2018 Jul;45(7):3449-3459. doi: 10.1002/mp.12967. Epub 2018 Jun 13.

引用本文的文献

Early-stage diagnosis of HIV-associated neurocognitive disorders via multiple learning models based on resting-state functional magnetic resonance imaging.

Quant Imaging Med Surg. 2025 Sep 1;15(9):7989-8007. doi: 10.21037/qims-2025-290. Epub 2025 Aug 19.

Ranking Nursing Diagnoses by Predictive Relevance for Intensive Care Unit Transfer Risk in Adult and Pediatric Patients: A Machine Learning Approach with Random Forest.

Healthcare (Basel). 2025 Jun 4;13(11):1339. doi: 10.3390/healthcare13111339.

OpenClustered: an R package with a benchmark suite of clustered datasets for methodological evaluation and comparison.

BMC Med Res Methodol. 2025 Apr 10;25(1):92. doi: 10.1186/s12874-025-02548-8.

Predicting the risk of relapsed or refractory in patients with diffuse large B-cell lymphoma via deep learning.

Front Oncol. 2025 Mar 3;15:1480645. doi: 10.3389/fonc.2025.1480645. eCollection 2025.

Utilizing Feature Selection Techniques for AI-Driven Tumor Subtype Classification: Enhancing Precision in Cancer Diagnostics.

Biomolecules. 2025 Jan 8;15(1):81. doi: 10.3390/biom15010081.

Validating Machine Learning Models Against the Saline Test Gold Standard for Primary Aldosteronism Diagnosis.

JACC Asia. 2024 Nov 12;4(12):972-984. doi: 10.1016/j.jacasi.2024.09.010. eCollection 2024 Dec.

Development and validation of interpretable machine learning models for postoperative pneumonia prediction.

Front Public Health. 2024 Dec 11;12:1468504. doi: 10.3389/fpubh.2024.1468504. eCollection 2024.

Application of Isokinetic Dynamometry Data in Predicting Gait Deviation Index Using Machine Learning in Stroke Patients: A Cross-Sectional Study.

Sensors (Basel). 2024 Nov 13;24(22):7258. doi: 10.3390/s24227258.

iNP_ESM: Neuropeptide Identification Based on Evolutionary Scale Modeling and Unified Representation Embedding Features.

Int J Mol Sci. 2024 Jun 27;25(13):7049. doi: 10.3390/ijms25137049.

A Machine Learning-Based Mortality Prediction Model for Patients with Chronic Hepatitis C Infection: An Exploratory Study.

J Clin Med. 2024 May 16;13(10):2939. doi: 10.3390/jcm13102939.

本文引用的文献

A Comparison of Random Forest Variable Selection Methods for Classification Prediction Modeling.

Expert Syst Appl. 2019 Nov 15;134:93-101. doi: 10.1016/j.eswa.2019.05.028. Epub 2019 May 23.

Random forests for high-dimensional longitudinal data.

Stat Methods Med Res. 2021 Jan;30(1):166-184. doi: 10.1177/0962280220946080. Epub 2020 Aug 9.

Machine Learning in Aging: An Example of Developing Prediction Models for Serious Fall Injury in Older Adults.

J Gerontol A Biol Sci Med Sci. 2021 Mar 31;76(4):647-654. doi: 10.1093/gerona/glaa138.

BiMM tree: A decision tree method for modeling clustered and longitudinal binary outcomes.

Commun Stat Simul Comput. 2020;49(4):1004-1023. doi: 10.1080/03610918.2018.1490429. Epub 2018 Sep 12.

Repeated measures random forests (RMRF): Identifying factors associated with nocturnal hypoglycemia.

Biometrics. 2021 Mar;77(1):343-351. doi: 10.1111/biom.13284. Epub 2020 May 6.

BiMM forest: A random forest method for modeling clustered and longitudinal binary outcomes.

Chemometr Intell Lab Syst. 2019 Feb 15;185:122-134. doi: 10.1016/j.chemolab.2019.01.002. Epub 2019 Jan 11.

Predicting daily outcomes in acetaminophen-induced acute liver failure patients with machine learning techniques.

Comput Methods Programs Biomed. 2019 Jul;175:111-120. doi: 10.1016/j.cmpb.2019.04.012. Epub 2019 Apr 11.

Mixed effect machine learning: A framework for predicting longitudinal change in hemoglobin A1c.

J Biomed Inform. 2019 Jan;89:56-67. doi: 10.1016/j.jbi.2018.09.001. Epub 2018 Sep 4.

Detecting treatment-subgroup interactions in clustered data with generalized linear mixed-effects model trees.

Behav Res Methods. 2018 Oct;50(5):2016-2034. doi: 10.3758/s13428-017-0971-x.

Longitudinal clinical score prediction in Alzheimer's disease with soft-split sparse regression based random forest.

Neurobiol Aging. 2016 Oct;46:180-91. doi: 10.1016/j.neurobiolaging.2016.07.005. Epub 2016 Jul 15.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

基于聚类和纵向数据的医学预测模型的特征选择随机森林方法。

A random forest method with feature selection for developing medical prediction models with clustered and longitudinal data.

机构信息

出版信息

BACKGROUND

METHODS

RESULTS

CONCLUSIONS

背景

方法

结果

结论

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献