Suppr超能文献

关于机器学习中缺失数据的一项调查。

A survey on missing data in machine learning.

作者信息

Emmanuel Tlamelo, Maupong Thabiso, Mpoeleng Dimane, Semong Thabo, Mphago Banyatsang, Tabona Oteng

机构信息

Department of Computer Science and Information Systems, Botswana International University of Science and Technology, Palapye, Botswana.

出版信息

J Big Data. 2021;8(1):140. doi: 10.1186/s40537-021-00516-9. Epub 2021 Oct 27.

Abstract

Machine learning has been the corner stone in analysing and extracting information from data and often a problem of missing values is encountered. Missing values occur because of various factors like missing completely at random, missing at random or missing not at random. All these may result from system malfunction during data collection or human error during data pre-processing. Nevertheless, it is important to deal with missing values before analysing data since ignoring or omitting missing values may result in biased or misinformed analysis. In literature there have been several proposals for handling missing values. In this paper, we aggregate some of the literature on missing data particularly focusing on machine learning techniques. We also give insight on how the machine learning approaches work by highlighting the key features of missing values imputation techniques, how they perform, their limitations and the kind of data they are most suitable for. We propose and evaluate two methods, the k nearest neighbor and an iterative imputation method (missForest) based on the random forest algorithm. Evaluation is performed on the Iris and novel power plant fan data with induced missing values at missingness rate of 5% to 20%. We show that both missForest and the k nearest neighbor can successfully handle missing values and offer some possible future research direction.

摘要

机器学习一直是从数据中分析和提取信息的基石,并且常常会遇到缺失值问题。缺失值的出现是由各种因素导致的,比如完全随机缺失、随机缺失或非随机缺失。所有这些情况可能源于数据收集过程中的系统故障,或者数据预处理过程中的人为错误。然而,在分析数据之前处理缺失值很重要,因为忽略或遗漏缺失值可能会导致有偏差或错误的分析。在文献中,已经有几种处理缺失值的提议。在本文中,我们汇总了一些关于缺失数据的文献,特别关注机器学习技术。我们还通过突出缺失值插补技术的关键特征、它们的性能表现、局限性以及它们最适合的数据类型,来深入了解机器学习方法是如何工作的。我们提出并评估了两种方法,即k近邻法和基于随机森林算法的迭代插补法(missForest)。在鸢尾花数据集和新的电厂风机数据集上进行评估,诱导缺失值的缺失率为5%至20%。我们表明,missForest和k近邻法都能成功处理缺失值,并提供了一些可能的未来研究方向。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/b8d4/8549433/97f379343241/40537_2021_516_Fig1_HTML.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验