Chen Sixia, Xu Chao
Department of Biostatistics and Epidemiology, The University of Oklahoma Health Sciences Center, Oklahoma City, OK, USA.
J Appl Stat. 2022 May 1;50(3):786-804. doi: 10.1080/02664763.2022.2068514. eCollection 2023.
High-dimensional data have been regarded as one of the most important types of big data in practice. It happens frequently in practice including genetic study, financial study, and geographical study. Missing data in high dimensional data analysis should be handled properly to reduce nonresponse bias. We discuss some modern machine learning techniques including penalized regression approaches, tree-based approaches, and deep learning (DL) for handling missing data with high dimensionality. Specifically, our proposed methods can be used for estimating general parameters of interest including population means and percentiles with imputation-based estimators, propensity score estimators, and doubly robust estimators. We compare those methods through some limited simulation studies and a real application. Both simulation studies and real application show the benefits of DL and XGboost approaches compared with other methods in terms of balancing bias and variance.
高维数据在实际应用中被视为最重要的大数据类型之一。它在包括基因研究、金融研究和地理研究等实际应用中经常出现。在高维数据分析中,缺失数据应得到妥善处理,以减少无应答偏差。我们讨论了一些现代机器学习技术,包括惩罚回归方法、基于树的方法和深度学习(DL),用于处理高维缺失数据。具体而言,我们提出的方法可用于估计一般感兴趣的参数,包括基于插补的估计器、倾向得分估计器和双重稳健估计器的总体均值和百分位数。我们通过一些有限的模拟研究和一个实际应用对这些方法进行了比较。模拟研究和实际应用均表明,与其他方法相比,DL和XGboost方法在平衡偏差和方差方面具有优势。