从世界卫生组织死亡率数据库的数据挖掘中吸取的经验教训。

Lessons learned from data mining of WHO mortality database.

作者信息

Paoin W

机构信息

Faculty of Medicine, Thammasat University, Rangsit Campus, Paholyotin Road, Pathumthani 12120, Thailand.

出版信息

Methods Inf Med. 2011;50(4):380-5. doi: 10.3414/ME10-02-0019. Epub 2011 Jun 21.

DOI:10.3414/ME10-02-0019

PMID:21691674

Abstract

OBJECTIVES

The objectives of this research were to test the ability of classification algorithms to predict the cause of death in the mortality data with unknown causes, to find association between common causes of death, to identify groups of countries based on their common causes of death, and to extract knowledge gained from data mining of the World Health Organization mortality database.

METHODS

The WEKA software version 3.5.3 was used for classification, clustering and association analysis of the World Health Organization mortality database which contained 1,109,537 records. Three major steps were performed: Step 1 - preprocessing of data to convert all records into suitable formats for each type of analysis algorithm; Step 2 - analyzing data using the C4.5 decision tree and Naïve Bayes classification algorithm, K-means clustering algorithm and Apriori association analysis algorithm; Step 3 - interpretation of results and hypothesis testing after clustering analysis.

RESULTS

Using a C4.5 decision tree classifier to predict cause of death, we obtained 440 leaf nodes that correctly classify death instances with an accuracy of 40.06%. Naïve Bayes classification algorithm calculated probability of death from each disease that correctly classify death instances with an accuracy of 28.13%. K means clustering divided the data into four clusters with 189, 59, 65, 144 country-years in each cluster. A Chi-square was used to test discriminate disease differences found in each cluster which had different diseases as predominant causes of death. Apriori association analysis produced association rules of linkage among cancer of the lung, hypertension and cerebrovascular diseases. These were found in the top five leading causes of death with 99-100% confidence level.

CONCLUSION

Classification tools produced the poorest results in predicting cause of death. Given the inadequacy of variables in the WHO database, creation of a classification model to predict specific cause of death was impossible. Clustering and association tools yielded interesting results that could be used to identify new areas of interest in mortality data analysis. This can be used in data mining analysis to help solve some quality problems in mortality data.

摘要

目标

本研究的目标是测试分类算法在死因不明的死亡率数据中预测死因的能力，找出常见死因之间的关联，根据共同死因识别国家群体，并从世界卫生组织死亡率数据库的数据挖掘中提取知识。

方法

使用WEKA 3.5.3软件对包含1,109,537条记录的世界卫生组织死亡率数据库进行分类、聚类和关联分析。执行了三个主要步骤：步骤1 - 数据预处理，将所有记录转换为适合每种分析算法的格式；步骤2 - 使用C4.5决策树和朴素贝叶斯分类算法、K均值聚类算法和Apriori关联分析算法分析数据；步骤3 - 聚类分析后的结果解释和假设检验。

结果

使用C4.5决策树分类器预测死因，我们获得了440个叶节点，这些节点正确分类死亡实例的准确率为40.06%。朴素贝叶斯分类算法计算了每种疾病导致死亡的概率，正确分类死亡实例的准确率为28.13%。K均值聚类将数据分为四个聚类，每个聚类分别有189、59、65、144个国家年。使用卡方检验来检验每个聚类中发现的不同疾病差异，每个聚类中不同疾病是主要死因。Apriori关联分析产生了肺癌、高血压和脑血管疾病之间的关联规则。这些在五大主要死因中被发现，置信水平为99 - 100%。