Chen Chih-Wen, Lin Wei-Chao, Ke Shih-Wen, Tsai Chih-Fong, Hu Ya-Han
Department of Pharmacy, Kaohsiung Municipal Chinese Medical Hospital, Taiwan.
Department of Computer Science and Information Engineering, Hwa Hsia University of Technology, Taiwan.
Technol Health Care. 2015;23(5):619-25. doi: 10.3233/THC-151018.
To collect medical datasets, it is usually the case that a number of data samples contain some missing values. Performing the data mining task over the incomplete datasets is a difficult problem. In general, missing value imputation can be approached, which aims at providing estimations for missing values by reasoning from the observed data. Consequently, the effectiveness of missing value imputation is heavily dependent on the observed data (or complete data) in the incomplete datasets.
In this paper, the research objective is to perform instance selection to filter out some noisy data (or outliers) from a given (complete) dataset to see its effect on the final imputation result. Specifically, four different processes of combining instance selection and missing value imputation are proposed and compared in terms of data classification.
Experiments are conducted based on 11 medical related datasets containing categorical, numerical, and mixed attribute types of data. In addition, missing values for each dataset are introduced into all attributes (the missing data rates are 10%, 20%, 30%, 40%, and 50%). For instance selection and missing value imputation, the DROP3 and k-nearest neighbor imputation methods are employed. On the other hand, the support vector machine (SVM) classifier is used to assess the final classification accuracy of the four different processes.
The experimental results show that the second process by performing instance selection first and imputation second allows the SVM classifiers to outperform the other processes.
For incomplete medical datasets containing some missing values, it is necessary to perform missing value imputation. In this paper, we demonstrate that instance selection can be used to filter out some noisy data or outliers before the imputation process. In other words, the observed data for missing value imputation may contain some noisy information, which can degrade the quality of the imputation result as well as the classification performance.
为了收集医学数据集,通常会有许多数据样本包含一些缺失值。对不完整的数据集执行数据挖掘任务是一个难题。一般来说,可以采用缺失值插补方法,其目的是通过从观测数据进行推理来为缺失值提供估计。因此,缺失值插补的有效性在很大程度上取决于不完整数据集中的观测数据(或完整数据)。
本文的研究目的是进行实例选择,从给定的(完整)数据集中过滤掉一些噪声数据(或离群值),以观察其对最终插补结果的影响。具体而言,提出了四种不同的将实例选择和缺失值插补相结合的过程,并在数据分类方面进行了比较。
基于11个包含分类、数值和混合属性类型数据的医学相关数据集进行实验。此外,将每个数据集的缺失值引入到所有属性中(缺失数据率分别为10%、20%、30%、40%和50%)。对于实例选择和缺失值插补,采用DROP3和k近邻插补方法。另一方面,使用支持向量机(SVM)分类器来评估这四种不同过程的最终分类准确率。
实验结果表明,先进行实例选择然后进行插补的第二个过程能使支持向量机分类器的性能优于其他过程。
对于包含一些缺失值的不完整医学数据集,有必要进行缺失值插补。在本文中,我们证明了在插补过程之前可以使用实例选择来过滤掉一些噪声数据或离群值。换句话说,用于缺失值插补的观测数据可能包含一些噪声信息,这会降低插补结果的质量以及分类性能。