Department of Informatics and Telematics, Harokopio University of Athens, Greece.
National School of Public Health, Athens, Greece.
Comput Methods Programs Biomed. 2017 Jul;145:73-83. doi: 10.1016/j.cmpb.2017.04.011. Epub 2017 Apr 13.
Micro or macro-level mapping of cancer statistics is a challenging task that requires long-term planning, prospective studies and continuous monitoring of all cancer cases. The objective of the current study is to present how cancer registry data could be processed using data mining techniques in order to improve the statistical analysis outcomes.
Data were collected from the Cancer Registry of Crete in Greece (counties of Rethymno and Lasithi) for the period 1998-2004. Data collection was performed on paper forms and manually transcribed to a single data file, thus introducing errors and noise (e.g. missing and erroneous values, duplicate entries etc.). Data were pre-processed and prepared for analysis using data mining tools and algorithms. Feature selection was applied to evaluate the contribution of each collected feature in predicting patients' survival. Several classifiers were trained and evaluated for their ability to predict survival of patients. Finally, statistical analysis of cancer morbidity and mortality rates in the two regions was performed in order to validate the initial findings.
Several critical points in the process of data collection, preprocessing and analysis of cancer data were derived from the results, while a road-map for future population data studies was developed. In addition, increased morbidity rates were observed in the counties of Crete (Age Standardized Morbidity/Incidence Rates ASIR= 396.45 ± 2.89 and 274.77 ±2.48 for men and women, respectively) compared to European and world averages (ASIR= 281.6 and 207.3 for men and women in Europe and 203.8 and 165.1 in world level). Significant variation in cancer types between sexes and age groups (the ratio between deaths and reported cases for young patients, less than 34 years old, is at 0.055 when the respective ratio for patients over 75 years old is 0.366) was also observed.
This study introduced a methodology for preprocessing and analyzing cancer data, using a combination of data mining techniques that could be a useful tool for other researchers and further enhancement of the cancer registries.
癌症统计数据的微观或宏观映射是一项具有挑战性的任务,需要长期规划、前瞻性研究以及对所有癌症病例的持续监测。本研究的目的是展示如何使用数据挖掘技术处理癌症登记数据,以改进统计分析结果。
数据来自希腊克里特岛的癌症登记处(雷西姆农和拉西锡县),时间范围为 1998-2004 年。数据收集是在纸质表格上进行的,并手动转录到一个单独的数据文件中,从而引入了错误和噪声(例如缺失和错误值、重复条目等)。使用数据挖掘工具和算法对数据进行预处理和准备,以进行分析。特征选择用于评估每个收集特征在预测患者生存方面的贡献。训练了多个分类器,并评估其预测患者生存的能力。最后,对两个地区的癌症发病率和死亡率进行了统计分析,以验证初始发现。
从结果中得出了数据收集、预处理和分析过程中的几个关键点,同时制定了未来人口数据研究的路线图。此外,与欧洲和世界平均水平相比,克里特岛的发病率更高(男性和女性的年龄标准化发病率/发生率 ASIR 分别为 396.45±2.89 和 274.77±2.48)(欧洲男性和女性的 ASIR 分别为 281.6 和 207.3,世界水平分别为 203.8 和 165.1)。还观察到男女和年龄组之间癌症类型的显著差异(年龄小于 34 岁的年轻患者的死亡与报告病例之比为 0.055,而年龄大于 75 岁的患者之比为 0.366)。
本研究介绍了一种使用数据挖掘技术组合预处理和分析癌症数据的方法,可为其他研究人员提供有用的工具,并进一步增强癌症登记处。