基础数据挖掘方法与集成数据挖掘方法在预测结直肠癌患者5年生存率中的比较
Comparison of Basic and Ensemble Data Mining Methods in Predicting 5-Year Survival of Colorectal Cancer Patients.
作者信息
Pourhoseingholi Mohamad Amin, Kheirian Sedigheh, Zali Mohammad Reza
机构信息
Gastroenterology and Liver Diseases Research Center, Research Institute for Gastroenterology and Liver Diseases, Shahid Beheshti University of Medical Sciences, Tehran, Iran.
Department of Health Informatics Technology and Management, School of Allied Medical Sciences, Shahid Beheshti University of Medical Sciences, Tehran, Iran.
出版信息
Acta Inform Med. 2017 Dec;25(4):254-258. doi: 10.5455/aim.2017.25.254-258.
INTRODUCTION
Colorectal cancer (CRC) is one of the most common malignancies and cause of cancer mortality worldwide. Given the importance of predicting the survival of CRC patients and the growing use of data mining methods, this study aims to compare the performance of models for predicting 5-year survival of CRC patients using variety of basic and ensemble data mining methods.
METHODS
The CRC dataset from The Shahid Beheshti University of Medical Sciences Research Center for Gastroenterology and Liver Diseases were used for prediction and comparative study of the base and ensemble data mining techniques. Feature selection methods were used to select predictor attributes for classification. The WEKA toolkit and MedCalc software were respectively utilized for creating and comparing the models.
RESULTS
The obtained results showed that the predictive performance of developed models was altogether high (all greater than 90%). Overall, the performance of ensemble models was higher than that of basic classifiers and the best result achieved by ensemble voting model in terms of area under the ROC curve (AUC= 0.96).
CONCLUSION
AUC Comparison of models showed that the ensemble voting method significantly outperformed all models except for two methods of Random Forest (RF) and Bayesian Network (BN) considered the overlapping 95% confidence intervals. This result may indicate high predictive power of these two methods along with ensemble voting for predicting 5-year survival of CRC patients.
引言
结直肠癌(CRC)是全球最常见的恶性肿瘤之一,也是癌症死亡的主要原因。鉴于预测CRC患者生存率的重要性以及数据挖掘方法的广泛应用,本研究旨在比较使用各种基本和集成数据挖掘方法预测CRC患者5年生存率的模型性能。
方法
使用来自沙希德·贝赫什提医科大学胃肠病学和肝病研究中心的CRC数据集,对基本和集成数据挖掘技术进行预测和比较研究。采用特征选择方法选择用于分类的预测属性。分别使用WEKA工具包和MedCalc软件创建和比较模型。
结果
所得结果表明,所开发模型的预测性能总体较高(均大于90%)。总体而言,集成模型的性能高于基本分类器,集成投票模型在ROC曲线下面积(AUC = 0.96)方面取得了最佳结果。
结论
模型的AUC比较表明,除了随机森林(RF)和贝叶斯网络(BN)这两种方法在95%置信区间重叠外,集成投票方法显著优于所有模型。这一结果可能表明这两种方法以及集成投票在预测CRC患者5年生存率方面具有较高的预测能力。