Emmanuel Tlamelo, Maupong Thabiso, Mpoeleng Dimane, Semong Thabo, Mphago Banyatsang, Tabona Oteng
Department of Computer Science and Information Systems, Botswana International University of Science and Technology, Palapye, Botswana.
J Big Data. 2021;8(1):140. doi: 10.1186/s40537-021-00516-9. Epub 2021 Oct 27.
Machine learning has been the corner stone in analysing and extracting information from data and often a problem of missing values is encountered. Missing values occur because of various factors like missing completely at random, missing at random or missing not at random. All these may result from system malfunction during data collection or human error during data pre-processing. Nevertheless, it is important to deal with missing values before analysing data since ignoring or omitting missing values may result in biased or misinformed analysis. In literature there have been several proposals for handling missing values. In this paper, we aggregate some of the literature on missing data particularly focusing on machine learning techniques. We also give insight on how the machine learning approaches work by highlighting the key features of missing values imputation techniques, how they perform, their limitations and the kind of data they are most suitable for. We propose and evaluate two methods, the k nearest neighbor and an iterative imputation method (missForest) based on the random forest algorithm. Evaluation is performed on the Iris and novel power plant fan data with induced missing values at missingness rate of 5% to 20%. We show that both missForest and the k nearest neighbor can successfully handle missing values and offer some possible future research direction.
机器学习一直是从数据中分析和提取信息的基石,并且常常会遇到缺失值问题。缺失值的出现是由各种因素导致的,比如完全随机缺失、随机缺失或非随机缺失。所有这些情况可能源于数据收集过程中的系统故障,或者数据预处理过程中的人为错误。然而,在分析数据之前处理缺失值很重要,因为忽略或遗漏缺失值可能会导致有偏差或错误的分析。在文献中,已经有几种处理缺失值的提议。在本文中,我们汇总了一些关于缺失数据的文献,特别关注机器学习技术。我们还通过突出缺失值插补技术的关键特征、它们的性能表现、局限性以及它们最适合的数据类型,来深入了解机器学习方法是如何工作的。我们提出并评估了两种方法,即k近邻法和基于随机森林算法的迭代插补法(missForest)。在鸢尾花数据集和新的电厂风机数据集上进行评估,诱导缺失值的缺失率为5%至20%。我们表明,missForest和k近邻法都能成功处理缺失值,并提供了一些可能的未来研究方向。