Pérez Joaquín, Iturbide Emmanuel, Olivares Víctor, Hidalgo Miguel, Martínez Alicia, Almanza Nelva
Tecnológico Nacional de México / CENIDET, Interior Internado Palmira s/n, Palmira, 62490, Cuernavaca, Morelos, Mexico.
Universidad Politécnica de Madrid, ETSII, Boadilla del Monte, Madrid, Spain.
J Med Syst. 2015 Nov;39(11):152. doi: 10.1007/s10916-015-0312-5. Epub 2015 Sep 18.
It is known that the data preparation phase is the most time consuming in the data mining process, using up to 50% or up to 70% of the total project time. Currently, data mining methodologies are of general purpose and one of their limitations is that they do not provide a guide about what particular task to develop in a specific domain. This paper shows a new data preparation methodology oriented to the epidemiological domain in which we have identified two sets of tasks: General Data Preparation and Specific Data Preparation. For both sets, the Cross-Industry Standard Process for Data Mining (CRISP-DM) is adopted as a guideline. The main contribution of our methodology is fourteen specialized tasks concerning such domain. To validate the proposed methodology, we developed a data mining system and the entire process was applied to real mortality databases. The results were encouraging because it was observed that the use of the methodology reduced some of the time consuming tasks and the data mining system showed findings of unknown and potentially useful patterns for the public health services in Mexico.
众所周知,数据准备阶段是数据挖掘过程中最耗时的,占用项目总时间的50%甚至70%。目前,数据挖掘方法是通用的,其局限性之一在于它们没有针对特定领域应开展的具体任务提供指导。本文展示了一种面向流行病学领域的新数据准备方法,我们在其中确定了两组任务:通用数据准备和特定数据准备。对于这两组任务,均采用跨行业数据挖掘标准流程(CRISP-DM)作为指导方针。我们方法的主要贡献是针对该领域的十四项专门任务。为了验证所提出的方法,我们开发了一个数据挖掘系统,并将整个过程应用于实际死亡率数据库。结果令人鼓舞,因为观察到该方法的使用减少了一些耗时任务,并且数据挖掘系统显示出了对墨西哥公共卫生服务而言未知且可能有用的模式。