Suppr超能文献

分类器集成方法解决缺失特征问题。

A classifier ensemble approach for the missing feature problem.

机构信息

Department of Information Engineering, University of Padua, Via Gradenigo, 6/B, 35131 Padova, Italy.

出版信息

Artif Intell Med. 2012 May;55(1):37-50. doi: 10.1016/j.artmed.2011.11.006. Epub 2011 Dec 20.

Abstract

OBJECTIVES

Many classification problems must deal with data that contains missing values. In such cases data imputation is critical. This paper evaluates the performance of several statistical and machine learning imputation methods, including our novel multiple imputation ensemble approach, using different datasets.

MATERIALS AND METHODS

Several state-of-the-art approaches are compared using different datasets. Some state-of-the-art classifiers (including support vector machines and input decimated ensembles) are tested with several imputation methods. The novel approach proposed in this work is a multiple imputation method based on random subspace, where each missing value is calculated considering a different cluster of the data. We have used a fuzzy clustering approach for the clustering algorithm.

RESULTS

Our experiments have shown that the proposed multiple imputation approach based on clustering and a random subspace classifier outperforms several other state-of-the-art approaches. Using the Wilcoxon signed-rank test (reject the null hypothesis, level of significance 0.05) we have shown that the proposed best approach is outperformed by the classifier trained using the original data (i.e., without missing values) only when >20% of the data are missed. Moreover, we have shown that coupling an imputation method with our cluster based imputation we outperform the base method (level of significance ∼0.05).

CONCLUSION

Starting from the assumptions that the feature set must be partially redundant and that the redundancy is distributed randomly over the feature set, we have proposed a method that works quite well even when a large percentage of the features is missing (≥30%). Our best approach is available (MATLAB code) at bias.csr.unibo.it/nanni/MI.rar.

摘要

目的

许多分类问题都必须处理包含缺失值的数据。在这种情况下,数据插补至关重要。本文使用不同的数据集评估了几种统计和机器学习插补方法的性能,包括我们新颖的多元插补集成方法。

材料和方法

使用不同的数据集比较了几种最先进的方法。使用几种插补方法测试了一些最先进的分类器(包括支持向量机和输入抽取集成)。本文提出的新方法是一种基于随机子空间的多元插补方法,其中每个缺失值的计算都考虑了数据的不同簇。我们使用模糊聚类算法作为聚类算法。

结果

我们的实验表明,基于聚类和随机子空间分类器的提出的多元插补方法优于其他几种最先进的方法。使用 Wilcoxon 符号秩检验(拒绝零假设,显著性水平为 0.05),我们表明,只有当 >20%的数据丢失时,使用原始数据(即没有缺失值)训练的分类器才能超过所提出的最佳方法。此外,我们表明,将插补方法与我们基于聚类的插补方法结合使用,可以优于基础方法(显著性水平约为 0.05)。

结论

基于特征集必须部分冗余且冗余随机分布在特征集的假设,我们提出了一种即使在丢失大量特征(≥30%)的情况下也能很好地工作的方法。我们的最佳方法可在 bias.csr.unibo.it/nanni/MI.rar 获得(MATLAB 代码)。

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验