关于机器学习中缺失数据的一项调查。

A survey on missing data in machine learning.

作者信息

Emmanuel Tlamelo, Maupong Thabiso, Mpoeleng Dimane, Semong Thabo, Mphago Banyatsang, Tabona Oteng

机构信息

Department of Computer Science and Information Systems, Botswana International University of Science and Technology, Palapye, Botswana.

出版信息

J Big Data. 2021;8(1):140. doi: 10.1186/s40537-021-00516-9. Epub 2021 Oct 27.

DOI:10.1186/s40537-021-00516-9

PMID:34722113

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC8549433/

Abstract

Machine learning has been the corner stone in analysing and extracting information from data and often a problem of missing values is encountered. Missing values occur because of various factors like missing completely at random, missing at random or missing not at random. All these may result from system malfunction during data collection or human error during data pre-processing. Nevertheless, it is important to deal with missing values before analysing data since ignoring or omitting missing values may result in biased or misinformed analysis. In literature there have been several proposals for handling missing values. In this paper, we aggregate some of the literature on missing data particularly focusing on machine learning techniques. We also give insight on how the machine learning approaches work by highlighting the key features of missing values imputation techniques, how they perform, their limitations and the kind of data they are most suitable for. We propose and evaluate two methods, the k nearest neighbor and an iterative imputation method (missForest) based on the random forest algorithm. Evaluation is performed on the Iris and novel power plant fan data with induced missing values at missingness rate of 5% to 20%. We show that both missForest and the k nearest neighbor can successfully handle missing values and offer some possible future research direction.

摘要

机器学习一直是从数据中分析和提取信息的基石，并且常常会遇到缺失值问题。缺失值的出现是由各种因素导致的，比如完全随机缺失、随机缺失或非随机缺失。所有这些情况可能源于数据收集过程中的系统故障，或者数据预处理过程中的人为错误。然而，在分析数据之前处理缺失值很重要，因为忽略或遗漏缺失值可能会导致有偏差或错误的分析。在文献中，已经有几种处理缺失值的提议。在本文中，我们汇总了一些关于缺失数据的文献，特别关注机器学习技术。我们还通过突出缺失值插补技术的关键特征、它们的性能表现、局限性以及它们最适合的数据类型，来深入了解机器学习方法是如何工作的。我们提出并评估了两种方法，即k近邻法和基于随机森林算法的迭代插补法（missForest）。在鸢尾花数据集和新的电厂风机数据集上进行评估，诱导缺失值的缺失率为5%至20%。我们表明，missForest和k近邻法都能成功处理缺失值，并提供了一些可能的未来研究方向。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/b8d4/8549433/97f379343241/40537_2021_516_Fig1_HTML.jpg

相似文献

A survey on missing data in machine learning.关于机器学习中缺失数据的一项调查。

J Big Data. 2021;8(1):140. doi: 10.1186/s40537-021-00516-9. Epub 2021 Oct 27.

Analyzing the Effect of Imputation on Classification Performance under MCAR and MAR Missing Mechanisms.分析在完全随机缺失（MCAR）和随机缺失（MAR）缺失机制下插补对分类性能的影响。

Entropy (Basel). 2023 Mar 17;25(3):521. doi: 10.3390/e25030521.

Generative adversarial networks for imputing missing data for big data clinical research.生成对抗网络在大数据临床研究中用于填补缺失数据。

BMC Med Res Methodol. 2021 Apr 20;21(1):78. doi: 10.1186/s12874-021-01272-3.

missForest with feature selection using binary particle swarm optimization improves the imputation accuracy of continuous data.使用二进制粒子群优化进行特征选择的 missForest 提高了连续数据的插补准确性。

Genes Genomics. 2022 Jun;44(6):651-658. doi: 10.1007/s13258-022-01247-8. Epub 2022 Apr 6.

MissForest--non-parametric missing value imputation for mixed-type data.MissForest--用于混合类型数据的非参数缺失值插补。

Bioinformatics. 2012 Jan 1;28(1):112-8. doi: 10.1093/bioinformatics/btr597. Epub 2011 Oct 28.

The Optimal Machine Learning-Based Missing Data Imputation for the Cox Proportional Hazard Model.基于最优机器学习的 Cox 比例风险模型缺失数据插补。

Front Public Health. 2021 Jul 5;9:680054. doi: 10.3389/fpubh.2021.680054. eCollection 2021.

Performance of Multiple Imputation Using Modern Machine Learning Methods in Electronic Health Records Data.基于现代机器学习方法在电子健康记录数据中的应用表现。

Epidemiology. 2023 Mar 1;34(2):206-215. doi: 10.1097/EDE.0000000000001578. Epub 2022 Dec 9.

Binned Data Provide Better Imputation of Missing Time Series Data from Wearables.分箱数据可更好地对可穿戴设备中缺失时间序列数据进行插补。

Sensors (Basel). 2023 Jan 28;23(3):1454. doi: 10.3390/s23031454.

Random Forest Missing Data Algorithms.随机森林缺失数据算法

Stat Anal Data Min. 2017 Dec;10(6):363-377. doi: 10.1002/sam.11348. Epub 2017 Jun 13.

Handling missing data in a rheumatoid arthritis registry using random forest approach.采用随机森林方法处理类风湿关节炎注册研究中的缺失数据。

Int J Rheum Dis. 2021 Oct;24(10):1282-1293. doi: 10.1111/1756-185X.14203. Epub 2021 Aug 12.

引用本文的文献

Comparing Multiple Imputation Methods to Address Missing Patient Demographics in Immunization Information Systems: Retrospective Cohort Study.比较多种多重填补方法以解决免疫接种信息系统中患者人口统计学数据缺失问题：回顾性队列研究。

JMIR Public Health Surveill. 2025 Aug 26;11:e73916. doi: 10.2196/73916.

Development and validation of a predictive model for new HIV infection screening among persons 15 years and above in primary healthcare settings in Kenya: a study protocol.肯尼亚初级卫生保健机构中15岁及以上人群新HIV感染筛查预测模型的开发与验证：一项研究方案

BMJ Health Care Inform. 2025 Aug 22;32(1):e101419. doi: 10.1136/bmjhci-2024-101419.

Dynamic Model Selection in a Hybrid Ensemble Framework for Robust Photovoltaic Power Forecasting.用于稳健光伏发电功率预测的混合集成框架中的动态模型选择

Sensors (Basel). 2025 Jul 19;25(14):4489. doi: 10.3390/s25144489.

Motoric Cognitive Risk Syndrome Associated With Risk of Frailty and Likelihood of Reversion in Older Adults.与老年人衰弱风险及恢复可能性相关的运动认知风险综合征

J Cachexia Sarcopenia Muscle. 2025 Aug;16(4):e70033. doi: 10.1002/jcsm.70033.

An Interpretable Machine Learning Model Based on Inflammatory-Nutritional Biomarkers for Predicting Metachronous Liver Metastases After Colorectal Cancer Surgery.一种基于炎症-营养生物标志物的可解释机器学习模型，用于预测结直肠癌手术后的异时性肝转移。

Biomedicines. 2025 Jul 12;13(7):1706. doi: 10.3390/biomedicines13071706.

Predicting OCD severity from religiosity and personality: A machine learning and neural network approach.从宗教信仰和人格预测强迫症严重程度：一种机器学习和神经网络方法。

J Mood Anxiety Disord. 2024 Oct 2;8:100089. doi: 10.1016/j.xjmad.2024.100089. eCollection 2024 Dec.

Detecting Important Features and Predicting Yield from Defects Detected by SEM in Semiconductor Production.在半导体生产中通过扫描电子显微镜检测重要特征并根据检测到的缺陷预测产量。

Sensors (Basel). 2025 Jul 6;25(13):4218. doi: 10.3390/s25134218.

Prediction of Cerebrospinal Fluid (CSF) Pressure with Generative Adversarial Network Synthetic Plasma-CSF Biomarker Pairing.利用生成对抗网络合成血浆-脑脊液生物标志物配对预测脑脊液（CSF）压力

Neuroinformatics. 2025 Jul 10;23(3):38. doi: 10.1007/s12021-025-09729-2.

An Artificial Intelligence Pipeline for Hepatocellular Carcinoma: From Data to Treatment Recommendations.一种用于肝细胞癌的人工智能流程：从数据到治疗建议

Int J Gen Med. 2025 Jul 2;18:3581-3595. doi: 10.2147/IJGM.S529322. eCollection 2025.

Comparative analysis of machine learning approaches for heatwave event prediction in India.印度热浪事件预测中机器学习方法的比较分析。

Sci Rep. 2025 Jul 1;15(1):22431. doi: 10.1038/s41598-025-04634-9.

本文引用的文献

Kernel weighted least square approach for imputing missing values of metabolomics data.核加权最小二乘法在代谢组学数据缺失值插补中的应用。

Sci Rep. 2021 May 27;11(1):11108. doi: 10.1038/s41598-021-90654-0.

Handling Complex Missing Data Using Random Forest Approach for an Air Quality Monitoring Dataset: A Case Study of Kuwait Environmental Data (2012 to 2018).利用随机森林方法处理空气质量监测数据集的复杂缺失数据：以科威特环境数据（2012 年至 2018 年）为例。

Int J Environ Res Public Health. 2021 Feb 2;18(3):1333. doi: 10.3390/ijerph18031333.

Ground PM prediction using imputed MAIAC AOD with uncertainty quantification.利用带有不确定性量化的插补 MAIAC AOD 进行地面 PM 预测。

Environ Pollut. 2021 Apr 1;274:116574. doi: 10.1016/j.envpol.2021.116574. Epub 2021 Jan 22.

Multiple imputation methods for handling missing values in longitudinal studies with sampling weights: Comparison of methods implemented in Stata.多重插补方法处理纵向研究中带有抽样权重的缺失值：Stata 中实现方法的比较。

Biom J. 2021 Feb;63(2):354-371. doi: 10.1002/bimj.201900360. Epub 2020 Oct 25.

Accuracy of random-forest-based imputation of missing data in the presence of non-normality, non-linearity, and interaction.基于随机森林的缺失数据插补在非正态性、非线性和交互作用存在下的准确性。

BMC Med Res Methodol. 2020 Jul 25;20(1):199. doi: 10.1186/s12874-020-01080-1.

A Method for Sensor-Based Activity Recognition in Missing Data Scenario.基于传感器的缺失数据场景下活动识别方法。

Sensors (Basel). 2020 Jul 8;20(14):3811. doi: 10.3390/s20143811.

SICE: an improved missing data imputation technique.SICE：一种改进的缺失数据插补技术。

J Big Data. 2020;7(1):37. doi: 10.1186/s40537-020-00313-w. Epub 2020 Jun 12.

Regression multiple imputation for missing data analysis.用于缺失数据分析的回归多重填补

Stat Methods Med Res. 2020 Sep;29(9):2647-2664. doi: 10.1177/0962280220908613. Epub 2020 Mar 4.

Approaches for missing covariate data in logistic regression with MNAR sensitivity analyses.具有 MAR 敏感性分析的逻辑回归中缺失协变量数据的处理方法。

Biom J. 2020 Jul;62(4):1025-1037. doi: 10.1002/bimj.201900117. Epub 2020 Jan 20.

Impact of missing data on bias and precision when estimating change in patient-reported outcomes from a clinical registry.从临床注册研究中估计患者报告结局变化时缺失数据对偏差和精度的影响。

Health Qual Life Outcomes. 2019 Jun 20;17(1):106. doi: 10.1186/s12955-019-1181-2.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

关于机器学习中缺失数据的一项调查。

A survey on missing data in machine learning.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献