运用现代机器学习技术处理存在缺失值的高维数据。

Handling high-dimensional data with missing values by modern machine learning techniques.

作者信息

Chen Sixia, Xu Chao

机构信息

Department of Biostatistics and Epidemiology, The University of Oklahoma Health Sciences Center, Oklahoma City, OK, USA.

出版信息

J Appl Stat. 2022 May 1;50(3):786-804. doi: 10.1080/02664763.2022.2068514. eCollection 2023.

DOI:10.1080/02664763.2022.2068514

PMID:36819079

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC9930810/

Abstract

High-dimensional data have been regarded as one of the most important types of big data in practice. It happens frequently in practice including genetic study, financial study, and geographical study. Missing data in high dimensional data analysis should be handled properly to reduce nonresponse bias. We discuss some modern machine learning techniques including penalized regression approaches, tree-based approaches, and deep learning (DL) for handling missing data with high dimensionality. Specifically, our proposed methods can be used for estimating general parameters of interest including population means and percentiles with imputation-based estimators, propensity score estimators, and doubly robust estimators. We compare those methods through some limited simulation studies and a real application. Both simulation studies and real application show the benefits of DL and XGboost approaches compared with other methods in terms of balancing bias and variance.

摘要

高维数据在实际应用中被视为最重要的大数据类型之一。它在包括基因研究、金融研究和地理研究等实际应用中经常出现。在高维数据分析中，缺失数据应得到妥善处理，以减少无应答偏差。我们讨论了一些现代机器学习技术，包括惩罚回归方法、基于树的方法和深度学习（DL），用于处理高维缺失数据。具体而言，我们提出的方法可用于估计一般感兴趣的参数，包括基于插补的估计器、倾向得分估计器和双重稳健估计器的总体均值和百分位数。我们通过一些有限的模拟研究和一个实际应用对这些方法进行了比较。模拟研究和实际应用均表明，与其他方法相比，DL和XGboost方法在平衡偏差和方差方面具有优势。

相似文献

Handling high-dimensional data with missing values by modern machine learning techniques.运用现代机器学习技术处理存在缺失值的高维数据。

J Appl Stat. 2022 May 1;50(3):786-804. doi: 10.1080/02664763.2022.2068514. eCollection 2023.

A unified framework of multiply robust estimation approaches for handling incomplete data.用于处理不完整数据的多重稳健估计方法的统一框架。

Comput Stat Data Anal. 2023 Mar;179. doi: 10.1016/j.csda.2022.107646. Epub 2022 Oct 21.

Machine Learning for Causal Inference: On the Use of Cross-fit Estimators.机器学习在因果推断中的应用：基于交叉拟合估计量的研究。

Epidemiology. 2021 May 1;32(3):393-401. doi: 10.1097/EDE.0000000000001332.

Machine learning outcome regression improves doubly robust estimation of average causal effects.机器学习结果回归改进了平均因果效应的双重稳健估计。

Pharmacoepidemiol Drug Saf. 2020 Sep;29(9):1120-1133. doi: 10.1002/pds.5074. Epub 2020 Jul 27.

Handling missing values in healthcare data: A systematic review of deep learning-based imputation techniques.处理医疗保健数据中的缺失值：基于深度学习的插补技术的系统评价。

Artif Intell Med. 2023 Aug;142:102587. doi: 10.1016/j.artmed.2023.102587. Epub 2023 May 22.

Double machine learning methods for estimating average treatment effects: a comparative study.用于估计平均治疗效果的双机器学习方法：一项比较研究。

J Biopharm Stat. 2025 Apr 21:1-20. doi: 10.1080/10543406.2025.2489281.

The Optimal Machine Learning-Based Missing Data Imputation for the Cox Proportional Hazard Model.基于最优机器学习的 Cox 比例风险模型缺失数据插补。

Front Public Health. 2021 Jul 5;9:680054. doi: 10.3389/fpubh.2021.680054. eCollection 2021.

Deep Learning Methods for Omics Data Imputation.用于组学数据插补的深度学习方法。

Biology (Basel). 2023 Oct 7;12(10):1313. doi: 10.3390/biology12101313.

Multiple Imputation for General Missing Data Patterns in the Presence of High-dimensional Data.高维数据存在时一般缺失数据模式的多重填补

Sci Rep. 2016 Feb 12;6:21689. doi: 10.1038/srep21689.

Identify the most appropriate imputation method for handling missing values in clinical structured datasets: a systematic review.识别处理临床结构化数据集缺失值的最合适插补方法：系统评价。

BMC Med Res Methodol. 2024 Aug 28;24(1):188. doi: 10.1186/s12874-024-02310-6.

引用本文的文献

An Artificial Intelligence Pipeline for Hepatocellular Carcinoma: From Data to Treatment Recommendations.一种用于肝细胞癌的人工智能流程：从数据到治疗建议

Int J Gen Med. 2025 Jul 2;18:3581-3595. doi: 10.2147/IJGM.S529322. eCollection 2025.

Editorial to the special issue: Statistical Approaches for Big Data and Machine Learning.特刊社论：大数据与机器学习的统计方法

J Appl Stat. 2023 Feb 7;50(3):451-455. doi: 10.1080/02664763.2023.2162471. eCollection 2023.

本文引用的文献

Asymptotic theory and inference of predictive mean matching imputation using a superpopulation model framework.基于超总体模型框架的预测均值匹配插补的渐近理论与推断

Scand Stat Theory Appl. 2020 Sep;47(3):839-861. doi: 10.1111/sjos.12429. Epub 2019 Nov 8.

A Deep Learning Approach for Missing Data Imputation of Rating Scales Assessing Attention-Deficit Hyperactivity Disorder.一种用于评估注意力缺陷多动障碍的评分量表缺失数据插补的深度学习方法。

Front Psychiatry. 2020 Jul 17;11:673. doi: 10.3389/fpsyt.2020.00673. eCollection 2020.

Pseudo-population bootstrap methods for imputed survey data.用于插补调查数据的伪总体自助法。

Biometrika. 2019 Jun;106(2):369-384. doi: 10.1093/biomet/asz001. Epub 2019 Apr 3.

Multiple imputation with sequential penalized regression.多重插补与序贯惩罚回归。

Stat Methods Med Res. 2019 May;28(5):1311-1327. doi: 10.1177/0962280218755574. Epub 2018 Feb 16.

Semiparametric fractional imputation using empirical likelihood in survey sampling.调查抽样中使用经验似然的半参数分数插补法。

Stat Theory Relat Fields. 2017;1(1):69-81. doi: 10.1080/24754269.2017.1328244. Epub 2017 Jun 1.

MISSING DATA IMPUTATION IN THE ELECTRONIC HEALTH RECORD USING DEEPLY LEARNED AUTOENCODERS.使用深度自编码器对电子健康记录中的缺失数据进行插补

Pac Symp Biocomput. 2017;22:207-218. doi: 10.1142/9789813207813_0021.

Deep learning.深度学习。

Nature. 2015 May 28;521(7553):436-44. doi: 10.1038/nature14539.

Statistical significance of variables driving systematic variation in high-dimensional data.驱动高维数据系统变异的变量的统计学显著性。

Bioinformatics. 2015 Feb 15;31(4):545-54. doi: 10.1093/bioinformatics/btu674. Epub 2014 Oct 21.

Validation of prediction models based on lasso regression with multiply imputed data.基于套索回归与多重填补数据的预测模型验证

BMC Med Res Methodol. 2014 Oct 16;14:116. doi: 10.1186/1471-2288-14-116.

Comparison of random forest and parametric imputation models for imputing missing data using MICE: a CALIBER study.基于 MICE 使用随机森林和参数插补模型比较缺失数据插补：CALIBER 研究。

Am J Epidemiol. 2014 Mar 15;179(6):764-74. doi: 10.1093/aje/kwt312. Epub 2014 Jan 12.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验