Department of Computer Languages and Systems, University of Cadiz, Cadiz, Spain.
Neural Netw. 2011 Jan;24(1):121-9. doi: 10.1016/j.neunet.2010.09.008. Epub 2010 Sep 17.
Data mining is based on data files which usually contain errors in the form of missing values. This paper focuses on a methodological framework for the development of an automated data imputation model based on artificial neural networks. Fifteen real and simulated data sets are exposed to a perturbation experiment, based on the random generation of missing values. These data set sizes range from 47 to 1389 records. A perturbation experiment was performed for each data set where the probability of missing value was set to 0.05. Several architectures and learning algorithms for the multilayer perceptron are tested and compared with three classic imputation procedures: mean/mode imputation, regression and hot-deck. The obtained results, considering different performance measures, not only suggest this approach improves the quality of a database with missing values, but also the best results are clearly obtained using the Multilayer Perceptron model in data sets with categorical variables. Three learning rules (Levenberg-Marquardt, BFGS Quasi-Newton and Conjugate Gradient Fletcher-Reeves Update) and a small number of hidden nodes are recommended.
数据挖掘是基于数据文件的,这些数据文件通常包含缺失值形式的错误。本文重点介绍了一种基于人工神经网络的自动化数据插补模型的开发方法框架。十五个真实和模拟数据集受到基于随机生成缺失值的扰动实验的影响。这些数据集的大小范围从 47 到 1389 条记录。对每个数据集都进行了扰动实验,其中缺失值的概率设置为 0.05。测试并比较了多层感知器的几种体系结构和学习算法,以及三种经典的插补程序:均值/众数插补、回归和热插补。考虑到不同的性能指标,得到的结果不仅表明这种方法可以提高具有缺失值的数据库的质量,而且在具有分类变量的数据集中,使用多层感知器模型可以明显获得更好的结果。建议使用三个学习规则(Levenberg-Marquardt、BFGS 拟牛顿和共轭梯度 Fletcher-Reeves 更新)和少量隐藏节点。