Zhao Jiwei, Chen Chi
Department of Biostatistics and Medical Informatics, University of Wisconsin-Madison, Madison, WI 53726, USA.
Novartis Institutes for Biomedical Research, Shanghai 201203, China.
Entropy (Basel). 2020 Oct 14;22(10):1154. doi: 10.3390/e22101154.
We study how to conduct statistical inference in a regression model where the outcome variable is prone to missing values and the missingness mechanism is unknown. The model we consider might be a traditional setting or a modern high-dimensional setting where the sparsity assumption is usually imposed and the regularization technique is popularly used. Motivated by the fact that the missingness mechanism, albeit usually treated as a nuisance, is difficult to specify correctly, we adopt the conditional likelihood approach so that the nuisance can be completely ignored throughout our procedure. We establish the asymptotic theory of the proposed estimator and develop an easy-to-implement algorithm via some data manipulation strategy. In particular, under the high-dimensional setting where regularization is needed, we propose a data perturbation method for the post-selection inference. The proposed methodology is especially appealing when the true missingness mechanism tends to be missing not at random, e.g., patient reported outcomes or real world data such as electronic health records. The performance of the proposed method is evaluated by comprehensive simulation experiments as well as a study of the albumin level in the MIMIC-III database.
我们研究如何在回归模型中进行统计推断,其中结果变量容易出现缺失值且缺失机制未知。我们考虑的模型可能是传统设置或现代高维设置,在高维设置中通常会施加稀疏性假设且正则化技术被广泛使用。鉴于缺失机制尽管通常被视为一个麻烦但难以正确指定,我们采用条件似然方法,以便在整个过程中可以完全忽略这个麻烦。我们建立了所提出估计量的渐近理论,并通过一些数据处理策略开发了一种易于实现的算法。特别是,在需要正则化的高维设置下,我们为选择后推断提出了一种数据扰动方法。当真正的缺失机制倾向于非随机缺失时,例如患者报告的结果或电子健康记录等真实世界数据时,所提出的方法特别有吸引力。通过全面的模拟实验以及对MIMIC - III数据库中白蛋白水平的研究来评估所提出方法的性能。