关于挖掘不完整医学数据集：排序插补与分类。

On mining incomplete medical datasets: Ordering imputation and classification.

作者信息

Chen Chih-Wen, Lin Wei-Chao, Ke Shih-Wen, Tsai Chih-Fong, Hu Ya-Han

机构信息

Department of Pharmacy, Kaohsiung Municipal Chinese Medical Hospital, Taiwan.

Department of Computer Science and Information Engineering, Hwa Hsia University of Technology, Taiwan.

出版信息

Technol Health Care. 2015;23(5):619-25. doi: 10.3233/THC-151018.

DOI:10.3233/THC-151018

PMID:26410122

Abstract

BACKGROUND

To collect medical datasets, it is usually the case that a number of data samples contain some missing values. Performing the data mining task over the incomplete datasets is a difficult problem. In general, missing value imputation can be approached, which aims at providing estimations for missing values by reasoning from the observed data. Consequently, the effectiveness of missing value imputation is heavily dependent on the observed data (or complete data) in the incomplete datasets.

OBJECTIVE

In this paper, the research objective is to perform instance selection to filter out some noisy data (or outliers) from a given (complete) dataset to see its effect on the final imputation result. Specifically, four different processes of combining instance selection and missing value imputation are proposed and compared in terms of data classification.

METHODS

Experiments are conducted based on 11 medical related datasets containing categorical, numerical, and mixed attribute types of data. In addition, missing values for each dataset are introduced into all attributes (the missing data rates are 10%, 20%, 30%, 40%, and 50%). For instance selection and missing value imputation, the DROP3 and k-nearest neighbor imputation methods are employed. On the other hand, the support vector machine (SVM) classifier is used to assess the final classification accuracy of the four different processes.

RESULTS

The experimental results show that the second process by performing instance selection first and imputation second allows the SVM classifiers to outperform the other processes.

CONCLUSIONS

For incomplete medical datasets containing some missing values, it is necessary to perform missing value imputation. In this paper, we demonstrate that instance selection can be used to filter out some noisy data or outliers before the imputation process. In other words, the observed data for missing value imputation may contain some noisy information, which can degrade the quality of the imputation result as well as the classification performance.

摘要

背景

为了收集医学数据集，通常会有许多数据样本包含一些缺失值。对不完整的数据集执行数据挖掘任务是一个难题。一般来说，可以采用缺失值插补方法，其目的是通过从观测数据进行推理来为缺失值提供估计。因此，缺失值插补的有效性在很大程度上取决于不完整数据集中的观测数据（或完整数据）。

目的

本文的研究目的是进行实例选择，从给定的（完整）数据集中过滤掉一些噪声数据（或离群值），以观察其对最终插补结果的影响。具体而言，提出了四种不同的将实例选择和缺失值插补相结合的过程，并在数据分类方面进行了比较。

方法

基于11个包含分类、数值和混合属性类型数据的医学相关数据集进行实验。此外，将每个数据集的缺失值引入到所有属性中（缺失数据率分别为10%、20%、30%、40%和50%）。对于实例选择和缺失值插补，采用DROP3和k近邻插补方法。另一方面，使用支持向量机（SVM）分类器来评估这四种不同过程的最终分类准确率。

结果

实验结果表明，先进行实例选择然后进行插补的第二个过程能使支持向量机分类器的性能优于其他过程。

结论

对于包含一些缺失值的不完整医学数据集，有必要进行缺失值插补。在本文中，我们证明了在插补过程之前可以使用实例选择来过滤掉一些噪声数据或离群值。换句话说，用于缺失值插补的观测数据可能包含一些噪声信息，这会降低插补结果的质量以及分类性能。

相似文献

On mining incomplete medical datasets: Ordering imputation and classification.

Technol Health Care. 2015;23(5):619-25. doi: 10.3233/THC-151018.

An efficient data preprocessing approach for large scale medical data mining.

Technol Health Care. 2015;23(2):153-60. doi: 10.3233/THC-140887.

Outlier Removal in Model-Based Missing Value Imputation for Medical Datasets.

J Healthc Eng. 2018 Feb 4;2018:1817479. doi: 10.1155/2018/1817479. eCollection 2018.

Combining data discretization and missing value imputation for incomplete medical datasets.

PLoS One. 2023 Nov 30;18(11):e0295032. doi: 10.1371/journal.pone.0295032. eCollection 2023.

Handling of missing data to improve the mining of large feed databases.

J Anim Sci. 2013 Jan;91(1):491-500. doi: 10.2527/jas.2012-5491. Epub 2012 Oct 9.

Empirical study of seven data mining algorithms on different characteristics of datasets for biomedical classification applications.

Biomed Eng Online. 2017 Nov 2;16(1):125. doi: 10.1186/s12938-017-0416-x.

R-Ensembler: A greedy rough set based ensemble attribute selection algorithm with kNN imputation for classification of medical data.

Comput Methods Programs Biomed. 2020 Feb;184:105122. doi: 10.1016/j.cmpb.2019.105122. Epub 2019 Oct 8.

Advanced methods for missing values imputation based on similarity learning.

PeerJ Comput Sci. 2021 Jul 21;7:e619. doi: 10.7717/peerj-cs.619. eCollection 2021.

A classifier ensemble approach for the missing feature problem.

Artif Intell Med. 2012 May;55(1):37-50. doi: 10.1016/j.artmed.2011.11.006. Epub 2011 Dec 20.

Deep learning based decision tree ensembles for incomplete medical datasets.

Technol Health Care. 2024;32(1):75-87. doi: 10.3233/THC-220514.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

关于挖掘不完整医学数据集：排序插补与分类。

On mining incomplete medical datasets: Ordering imputation and classification.

作者信息

Chen Chih-Wen, Lin Wei-Chao, Ke Shih-Wen, Tsai Chih-Fong, Hu Ya-Han

机构信息

Department of Pharmacy, Kaohsiung Municipal Chinese Medical Hospital, Taiwan.

Department of Computer Science and Information Engineering, Hwa Hsia University of Technology, Taiwan.

出版信息

Technol Health Care. 2015;23(5):619-25. doi: 10.3233/THC-151018.

DOI:10.3233/THC-151018

PMID:26410122

Abstract

BACKGROUND

OBJECTIVE

METHODS

RESULTS

The experimental results show that the second process by performing instance selection first and imputation second allows the SVM classifiers to outperform the other processes.

CONCLUSIONS

摘要

背景

目的

方法

结果

实验结果表明，先进行实例选择然后进行插补的第二个过程能使支持向量机分类器的性能优于其他过程。

关于挖掘不完整医学数据集：排序插补与分类。

On mining incomplete medical datasets: Ordering imputation and classification.

作者信息

机构信息

出版信息

BACKGROUND

OBJECTIVE

METHODS

RESULTS

CONCLUSIONS

背景

目的

方法

结果

结论

相似文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

关于挖掘不完整医学数据集：排序插补与分类。

On mining incomplete medical datasets: Ordering imputation and classification.

作者信息

机构信息

出版信息

BACKGROUND

OBJECTIVE

METHODS

RESULTS

CONCLUSIONS

背景

目的

方法

结果

结论

相似文献