Paoin W
Faculty of Medicine, Thammasat University, Rangsit Campus, Paholyotin Road, Pathumthani 12120, Thailand.
Methods Inf Med. 2011;50(4):380-5. doi: 10.3414/ME10-02-0019. Epub 2011 Jun 21.
The objectives of this research were to test the ability of classification algorithms to predict the cause of death in the mortality data with unknown causes, to find association between common causes of death, to identify groups of countries based on their common causes of death, and to extract knowledge gained from data mining of the World Health Organization mortality database.
The WEKA software version 3.5.3 was used for classification, clustering and association analysis of the World Health Organization mortality database which contained 1,109,537 records. Three major steps were performed: Step 1 - preprocessing of data to convert all records into suitable formats for each type of analysis algorithm; Step 2 - analyzing data using the C4.5 decision tree and Naïve Bayes classification algorithm, K-means clustering algorithm and Apriori association analysis algorithm; Step 3 - interpretation of results and hypothesis testing after clustering analysis.
Using a C4.5 decision tree classifier to predict cause of death, we obtained 440 leaf nodes that correctly classify death instances with an accuracy of 40.06%. Naïve Bayes classification algorithm calculated probability of death from each disease that correctly classify death instances with an accuracy of 28.13%. K means clustering divided the data into four clusters with 189, 59, 65, 144 country-years in each cluster. A Chi-square was used to test discriminate disease differences found in each cluster which had different diseases as predominant causes of death. Apriori association analysis produced association rules of linkage among cancer of the lung, hypertension and cerebrovascular diseases. These were found in the top five leading causes of death with 99-100% confidence level.
Classification tools produced the poorest results in predicting cause of death. Given the inadequacy of variables in the WHO database, creation of a classification model to predict specific cause of death was impossible. Clustering and association tools yielded interesting results that could be used to identify new areas of interest in mortality data analysis. This can be used in data mining analysis to help solve some quality problems in mortality data.
本研究的目标是测试分类算法在死因不明的死亡率数据中预测死因的能力,找出常见死因之间的关联,根据共同死因识别国家群体,并从世界卫生组织死亡率数据库的数据挖掘中提取知识。
使用WEKA 3.5.3软件对包含1,109,537条记录的世界卫生组织死亡率数据库进行分类、聚类和关联分析。执行了三个主要步骤:步骤1 - 数据预处理,将所有记录转换为适合每种分析算法的格式;步骤2 - 使用C4.5决策树和朴素贝叶斯分类算法、K均值聚类算法和Apriori关联分析算法分析数据;步骤3 - 聚类分析后的结果解释和假设检验。
使用C4.5决策树分类器预测死因,我们获得了440个叶节点,这些节点正确分类死亡实例的准确率为40.06%。朴素贝叶斯分类算法计算了每种疾病导致死亡的概率,正确分类死亡实例的准确率为28.13%。K均值聚类将数据分为四个聚类,每个聚类分别有189、59、65、144个国家年。使用卡方检验来检验每个聚类中发现的不同疾病差异,每个聚类中不同疾病是主要死因。Apriori关联分析产生了肺癌、高血压和脑血管疾病之间的关联规则。这些在五大主要死因中被发现,置信水平为99 - 100%。
分类工具在预测死因方面产生的结果最差。鉴于世界卫生组织数据库中变量的不足,创建一个预测特定死因的分类模型是不可能的。聚类和关联工具产生了有趣的结果,可用于识别死亡率数据分析中的新感兴趣领域。这可用于数据挖掘分析,以帮助解决死亡率数据中的一些质量问题。