Jin Jun, Liu Shuangzhe, Ma Tiefeng
College of Mathematical Sciences, Yangzhou University, Yangzhou, People's Republic of China.
Faculty of Science and Technology, University of Canberra, Canberra, Australia.
J Appl Stat. 2023 Apr 26;51(8):1427-1445. doi: 10.1080/02664763.2023.2205611. eCollection 2024.
Datasets that are big with regard to their volume, variety and velocity are becoming increasingly common. However, limitations in computer processing can often restrict analysis performed on them. Nonuniform subsampling methods are effective in reducing computational loads for massive data. However, the variance of the estimator of nonuniform subsampling methods becomes large when the subsampling probabilities are highly heterogenous. To this end, we develop two new algorithms to improve the estimation method for massive data logistic regression based on a chosen hard threshold value and combining subsamples, respectively. The basic idea of the hard threshold method is to carefully select a threshold value and then replace subsampling probabilities lower than the threshold value with the chosen value itself. The main idea behind the combining subsamples method is to better exploit information in the data without hitting the computation bottleneck by generating many subsamples and then combining estimates constructed from the subsamples. The combining subsamples method obtains the standard error of the parameter estimator without estimating the sandwich matrix, which provides convenience for statistical inference in massive data, and can significantly improve the estimation efficiency. Asymptotic properties of the resultant estimators are established. Simulations and analysis of real data are conducted to assess and showcase the practical performance of the proposed methods.
在体量、多样性和速度方面规模巨大的数据集正变得越来越常见。然而,计算机处理能力的限制常常会制约对它们进行的分析。非均匀子采样方法在减轻海量数据的计算负担方面很有效。然而,当子采样概率高度不均匀时,非均匀子采样方法估计量的方差会变大。为此,我们分别基于选定的硬阈值和组合子样本开发了两种新算法,以改进海量数据逻辑回归的估计方法。硬阈值方法的基本思想是仔细选择一个阈值,然后将低于该阈值的子采样概率用选定的值本身替换。组合子样本方法背后的主要思想是通过生成许多子样本,然后组合从这些子样本构建的估计值,从而在不触及计算瓶颈的情况下更好地利用数据中的信息。组合子样本方法在不估计三明治矩阵的情况下获得参数估计量的标准误差,这为海量数据中的统计推断提供了便利,并且可以显著提高估计效率。建立了所得估计量的渐近性质。进行了模拟和实际数据分析,以评估和展示所提出方法的实际性能。