基于随机森林算法从高度不平衡数据中预测疾病风险。

Predicting disease risks from highly imbalanced data using random forest.

机构信息

Department of Computer Science, University of Missouri, Columbia, Missouri, USA.

出版信息

BMC Med Inform Decis Mak. 2011 Jul 29;11:51. doi: 10.1186/1472-6947-11-51.

DOI:10.1186/1472-6947-11-51

PMID:21801360

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC3163175/

Abstract

BACKGROUND

We present a method utilizing Healthcare Cost and Utilization Project (HCUP) dataset for predicting disease risk of individuals based on their medical diagnosis history. The presented methodology may be incorporated in a variety of applications such as risk management, tailored health communication and decision support systems in healthcare.

METHODS

We employed the National Inpatient Sample (NIS) data, which is publicly available through Healthcare Cost and Utilization Project (HCUP), to train random forest classifiers for disease prediction. Since the HCUP data is highly imbalanced, we employed an ensemble learning approach based on repeated random sub-sampling. This technique divides the training data into multiple sub-samples, while ensuring that each sub-sample is fully balanced. We compared the performance of support vector machine (SVM), bagging, boosting and RF to predict the risk of eight chronic diseases.

RESULTS

We predicted eight disease categories. Overall, the RF ensemble learning method outperformed SVM, bagging and boosting in terms of the area under the receiver operating characteristic (ROC) curve (AUC). In addition, RF has the advantage of computing the importance of each variable in the classification process.

CONCLUSIONS

In combining repeated random sub-sampling with RF, we were able to overcome the class imbalance problem and achieve promising results. Using the national HCUP data set, we predicted eight disease categories with an average AUC of 88.79%.

摘要

背景

我们提出了一种利用医疗保健成本和利用项目（HCUP）数据集的方法，根据个人的医疗诊断史预测其疾病风险。所提出的方法可以应用于各种应用中，例如风险管理、定制健康沟通和医疗保健中的决策支持系统。

方法

我们利用国家住院患者样本（NIS）数据，该数据通过医疗保健成本和利用项目（HCUP）公开获取，用于训练疾病预测的随机森林分类器。由于 HCUP 数据高度不平衡，我们采用了基于重复随机子抽样的集成学习方法。该技术将训练数据分为多个子样本，同时确保每个子样本都是完全平衡的。我们比较了支持向量机（SVM）、袋装、提升和 RF 来预测八种慢性病的风险。

结果

我们预测了八种疾病类别。总体而言，RF 集成学习方法在接收者操作特征（ROC）曲线下的面积（AUC）方面优于 SVM、袋装和提升。此外，RF 具有在分类过程中计算每个变量重要性的优势。

结论

通过将重复随机子抽样与 RF 相结合，我们能够克服类别不平衡问题并取得良好的效果。使用全国 HCUP 数据集，我们预测了八种疾病类别，平均 AUC 为 88.79%。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/52a8/3163175/9b04504983ff/1472-6947-11-51-1.jpg

相似文献

Predicting disease risks from highly imbalanced data using random forest.

BMC Med Inform Decis Mak. 2011 Jul 29;11:51. doi: 10.1186/1472-6947-11-51.

Statistical geometry based prediction of nonsynonymous SNP functional effects using random forest and neuro-fuzzy classifiers.

Proteins. 2008 Jun;71(4):1930-9. doi: 10.1002/prot.21838.

Stroke Prediction with Machine Learning Methods among Older Chinese.

Int J Environ Res Public Health. 2020 Mar 12;17(6):1828. doi: 10.3390/ijerph17061828.

Structure-activity relationship-based chemical classification of highly imbalanced Tox21 datasets.

J Cheminform. 2020 Oct 27;12(1):66. doi: 10.1186/s13321-020-00468-x.

Ensemble Feature Learning of Genomic Data Using Support Vector Machine.

PLoS One. 2016 Jun 15;11(6):e0157330. doi: 10.1371/journal.pone.0157330. eCollection 2016.

Predicting the sorption efficiency of heavy metal based on the biochar characteristics, metal sources, and environmental conditions using various novel hybrid machine learning models.

Chemosphere. 2021 Aug;276:130204. doi: 10.1016/j.chemosphere.2021.130204. Epub 2021 Mar 9.

Random forests ensemble classifier trained with data resampling strategy to improve cardiac arrhythmia diagnosis.

Comput Biol Med. 2011 May;41(5):265-71. doi: 10.1016/j.compbiomed.2011.03.001. Epub 2011 Mar 17.

Application of Artificial Intelligence for Preoperative Diagnostic and Prognostic Prediction in Epithelial Ovarian Cancer Based on Blood Biomarkers.

Clin Cancer Res. 2019 May 15;25(10):3006-3015. doi: 10.1158/1078-0432.CCR-18-3378. Epub 2019 Apr 11.

Prediction modeling using EHR data: challenges, strategies, and a comparison of machine learning approaches.

Med Care. 2010 Jun;48(6 Suppl):S106-13. doi: 10.1097/MLR.0b013e3181de9e17.

Class-imbalanced classifiers for high-dimensional data.

Brief Bioinform. 2013 Jan;14(1):13-26. doi: 10.1093/bib/bbs006. Epub 2012 Mar 9.

引用本文的文献

Predictive modeling of bruising in broiler chickens using machine learning algorithms.

Poult Sci. 2025 Aug 29;104(11):105756. doi: 10.1016/j.psj.2025.105756.

Machine learning models to predict postoperative incontinence after endoscopic enucleation of the prostate for benign prostatic hyperplasia: An EAU-Endourology study.

Prostate Cancer Prostatic Dis. 2025 Aug 19. doi: 10.1038/s41391-025-01015-1.

A review of machine learning applications in heart health.

Biomed Eng Online. 2025 Aug 11;24(1):99. doi: 10.1186/s12938-025-01430-4.

A multimodal dataset for precision oncology in head and neck cancer.

Nat Commun. 2025 Aug 4;16(1):7163. doi: 10.1038/s41467-025-62386-6.

Fine-Scale Risk Mapping for Dengue Vector Using Spatial Downscaling in Intra-Urban Areas of Guangzhou, China.

Insects. 2025 Jun 25;16(7):661. doi: 10.3390/insects16070661.

The Use of Machine Learning for Analyzing Real-World Data in Disease Prediction and Management: Systematic Review.

JMIR Med Inform. 2025 Jun 19;13:e68898. doi: 10.2196/68898.

Machine learning model for age related macular degeneration based on pesticides: the National Health and Nutrition Examination Survey 2007-2008.

Front Public Health. 2025 Apr 16;13:1561913. doi: 10.3389/fpubh.2025.1561913. eCollection 2025.

A new Tec family-based clinical model predicts survival in differentiated thyroid cancer patients via machine learning.

Thyroid Res. 2025 May 1;18(1):18. doi: 10.1186/s13044-025-00234-x.

Using Machine Learning to Predict Cognitive Decline in Older Adults From the Chinese Longitudinal Healthy Longevity Survey: Model Development and Validation Study.

JMIR Aging. 2025 Apr 30;8:e67437. doi: 10.2196/67437.

Development and validation of a logistic regression model for the diagnosis of colorectal cancer.

Sci Rep. 2025 Apr 28;15(1):14759. doi: 10.1038/s41598-025-98968-z.

本文引用的文献

Application of support vector machine modeling for prediction of common diseases: the case of diabetes and pre-diabetes.

BMC Med Inform Decis Mak. 2010 Mar 22;10:16. doi: 10.1186/1472-6947-10-16.

A smart home application to eldercare: current status and lessons learned.

Technol Health Care. 2009;17(3):183-201. doi: 10.3233/THC-2009-0551.

A comparison of random forest and its Gini importance with standard chemometric methods for the feature selection and classification of spectral data.

BMC Bioinformatics. 2009 Jul 10;10:213. doi: 10.1186/1471-2105-10-213.

Cancer coverage in general-audience and Black newspapers.

Health Commun. 2008 Sep;23(5):427-35. doi: 10.1080/10410230802342176.

Random forest models to predict aqueous solubility.

J Chem Inf Model. 2007 Jan-Feb;47(1):150-8. doi: 10.1021/ci060164k.

Identifying persons with diabetes using Medicare claims data.

Am J Med Qual. 1999 Nov-Dec;14(6):270-7. doi: 10.1177/106286069901400607.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

基于随机森林算法从高度不平衡数据中预测疾病风险。

Predicting disease risks from highly imbalanced data using random forest.

机构信息

出版信息

BACKGROUND

METHODS

RESULTS

CONCLUSIONS

背景

方法

结果

结论

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献