Mao Xiaojun, Wang Hengfang, Wang Zhonglei, Yang Shu
School of Mathematical Sciences, Ministry of Education Key Laboratory of Scientific and Engineering Computing, Shanghai Jiao Tong University, Shanghai, 200240, China.
School of Mathematics and Statistics & Fujian Provincial Key Laboratory of Statistics and Artificial Intelligence, Fujian Normal University, Fujian 350007, China.
J Comput Graph Stat. 2024;33(4):1320-1328. doi: 10.1080/10618600.2024.2319154. Epub 2024 Mar 29.
Modern surveys with large sample sizes and growing mixed-type questionnaires require robust and scalable analysis methods. In this work, we consider recovering a mixed dataframe matrix, obtained by complex survey sampling, with entries following different canonical exponential distributions and subject to heterogeneous missingness. To tackle this challenging task, we propose a two-stage procedure: in the first stage, we model the entry-wise missing mechanism by logistic regression, and in the second stage, we complete the target parameter matrix by maximizing a weighted log-likelihood with a low-rank constraint. We propose a fast and scalable estimation algorithm that achieves sublinear convergence, and the upper bound for the estimation error of the proposed method is rigorously derived. Experimental results support our theoretical claims, and the proposed estimator shows its merits compared to other existing methods. The proposed method is applied to analyze the National Health and Nutrition Examination Survey data. Supplementary materialsfor this article are available online.
现代大规模抽样调查以及日益增多的混合型问卷需要强大且可扩展的分析方法。在这项工作中,我们考虑恢复一个通过复杂抽样调查获得的混合型数据框矩阵,其元素服从不同的标准指数分布且存在异质性缺失。为解决这一具有挑战性的任务,我们提出了一个两阶段程序:在第一阶段,我们通过逻辑回归对逐个元素的缺失机制进行建模;在第二阶段,我们通过最大化带有低秩约束的加权对数似然来完成目标参数矩阵。我们提出了一种实现次线性收敛的快速且可扩展的估计算法,并严格推导了所提方法估计误差的上界。实验结果支持了我们的理论主张,并且与其他现有方法相比,所提估计器展现出了其优势。所提方法被应用于分析美国国家健康与营养检查调查数据。本文的补充材料可在线获取。