Suppr超能文献

基于随机森林算法从高度不平衡数据中预测疾病风险。

Predicting disease risks from highly imbalanced data using random forest.

机构信息

Department of Computer Science, University of Missouri, Columbia, Missouri, USA.

出版信息

BMC Med Inform Decis Mak. 2011 Jul 29;11:51. doi: 10.1186/1472-6947-11-51.

Abstract

BACKGROUND

We present a method utilizing Healthcare Cost and Utilization Project (HCUP) dataset for predicting disease risk of individuals based on their medical diagnosis history. The presented methodology may be incorporated in a variety of applications such as risk management, tailored health communication and decision support systems in healthcare.

METHODS

We employed the National Inpatient Sample (NIS) data, which is publicly available through Healthcare Cost and Utilization Project (HCUP), to train random forest classifiers for disease prediction. Since the HCUP data is highly imbalanced, we employed an ensemble learning approach based on repeated random sub-sampling. This technique divides the training data into multiple sub-samples, while ensuring that each sub-sample is fully balanced. We compared the performance of support vector machine (SVM), bagging, boosting and RF to predict the risk of eight chronic diseases.

RESULTS

We predicted eight disease categories. Overall, the RF ensemble learning method outperformed SVM, bagging and boosting in terms of the area under the receiver operating characteristic (ROC) curve (AUC). In addition, RF has the advantage of computing the importance of each variable in the classification process.

CONCLUSIONS

In combining repeated random sub-sampling with RF, we were able to overcome the class imbalance problem and achieve promising results. Using the national HCUP data set, we predicted eight disease categories with an average AUC of 88.79%.

摘要

背景

我们提出了一种利用医疗保健成本和利用项目(HCUP)数据集的方法,根据个人的医疗诊断史预测其疾病风险。所提出的方法可以应用于各种应用中,例如风险管理、定制健康沟通和医疗保健中的决策支持系统。

方法

我们利用国家住院患者样本(NIS)数据,该数据通过医疗保健成本和利用项目(HCUP)公开获取,用于训练疾病预测的随机森林分类器。由于 HCUP 数据高度不平衡,我们采用了基于重复随机子抽样的集成学习方法。该技术将训练数据分为多个子样本,同时确保每个子样本都是完全平衡的。我们比较了支持向量机(SVM)、袋装、提升和 RF 来预测八种慢性病的风险。

结果

我们预测了八种疾病类别。总体而言,RF 集成学习方法在接收者操作特征(ROC)曲线下的面积(AUC)方面优于 SVM、袋装和提升。此外,RF 具有在分类过程中计算每个变量重要性的优势。

结论

通过将重复随机子抽样与 RF 相结合,我们能够克服类别不平衡问题并取得良好的效果。使用全国 HCUP 数据集,我们预测了八种疾病类别,平均 AUC 为 88.79%。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/52a8/3163175/9b04504983ff/1472-6947-11-51-1.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验