Department of Computer Science, University of Missouri, Columbia, Missouri, USA.
BMC Med Inform Decis Mak. 2011 Jul 29;11:51. doi: 10.1186/1472-6947-11-51.
We present a method utilizing Healthcare Cost and Utilization Project (HCUP) dataset for predicting disease risk of individuals based on their medical diagnosis history. The presented methodology may be incorporated in a variety of applications such as risk management, tailored health communication and decision support systems in healthcare.
We employed the National Inpatient Sample (NIS) data, which is publicly available through Healthcare Cost and Utilization Project (HCUP), to train random forest classifiers for disease prediction. Since the HCUP data is highly imbalanced, we employed an ensemble learning approach based on repeated random sub-sampling. This technique divides the training data into multiple sub-samples, while ensuring that each sub-sample is fully balanced. We compared the performance of support vector machine (SVM), bagging, boosting and RF to predict the risk of eight chronic diseases.
We predicted eight disease categories. Overall, the RF ensemble learning method outperformed SVM, bagging and boosting in terms of the area under the receiver operating characteristic (ROC) curve (AUC). In addition, RF has the advantage of computing the importance of each variable in the classification process.
In combining repeated random sub-sampling with RF, we were able to overcome the class imbalance problem and achieve promising results. Using the national HCUP data set, we predicted eight disease categories with an average AUC of 88.79%.
我们提出了一种利用医疗保健成本和利用项目(HCUP)数据集的方法,根据个人的医疗诊断史预测其疾病风险。所提出的方法可以应用于各种应用中,例如风险管理、定制健康沟通和医疗保健中的决策支持系统。
我们利用国家住院患者样本(NIS)数据,该数据通过医疗保健成本和利用项目(HCUP)公开获取,用于训练疾病预测的随机森林分类器。由于 HCUP 数据高度不平衡,我们采用了基于重复随机子抽样的集成学习方法。该技术将训练数据分为多个子样本,同时确保每个子样本都是完全平衡的。我们比较了支持向量机(SVM)、袋装、提升和 RF 来预测八种慢性病的风险。
我们预测了八种疾病类别。总体而言,RF 集成学习方法在接收者操作特征(ROC)曲线下的面积(AUC)方面优于 SVM、袋装和提升。此外,RF 具有在分类过程中计算每个变量重要性的优势。
通过将重复随机子抽样与 RF 相结合,我们能够克服类别不平衡问题并取得良好的效果。使用全国 HCUP 数据集,我们预测了八种疾病类别,平均 AUC 为 88.79%。