Suppr超能文献

用于海量数据逻辑回归的稳健且高效的子采样算法。

Robust and efficient subsampling algorithms for massive data logistic regression.

作者信息

Jin Jun, Liu Shuangzhe, Ma Tiefeng

机构信息

College of Mathematical Sciences, Yangzhou University, Yangzhou, People's Republic of China.

Faculty of Science and Technology, University of Canberra, Canberra, Australia.

出版信息

J Appl Stat. 2023 Apr 26;51(8):1427-1445. doi: 10.1080/02664763.2023.2205611. eCollection 2024.

Abstract

Datasets that are big with regard to their volume, variety and velocity are becoming increasingly common. However, limitations in computer processing can often restrict analysis performed on them. Nonuniform subsampling methods are effective in reducing computational loads for massive data. However, the variance of the estimator of nonuniform subsampling methods becomes large when the subsampling probabilities are highly heterogenous. To this end, we develop two new algorithms to improve the estimation method for massive data logistic regression based on a chosen hard threshold value and combining subsamples, respectively. The basic idea of the hard threshold method is to carefully select a threshold value and then replace subsampling probabilities lower than the threshold value with the chosen value itself. The main idea behind the combining subsamples method is to better exploit information in the data without hitting the computation bottleneck by generating many subsamples and then combining estimates constructed from the subsamples. The combining subsamples method obtains the standard error of the parameter estimator without estimating the sandwich matrix, which provides convenience for statistical inference in massive data, and can significantly improve the estimation efficiency. Asymptotic properties of the resultant estimators are established. Simulations and analysis of real data are conducted to assess and showcase the practical performance of the proposed methods.

摘要

在体量、多样性和速度方面规模巨大的数据集正变得越来越常见。然而,计算机处理能力的限制常常会制约对它们进行的分析。非均匀子采样方法在减轻海量数据的计算负担方面很有效。然而,当子采样概率高度不均匀时,非均匀子采样方法估计量的方差会变大。为此,我们分别基于选定的硬阈值和组合子样本开发了两种新算法,以改进海量数据逻辑回归的估计方法。硬阈值方法的基本思想是仔细选择一个阈值,然后将低于该阈值的子采样概率用选定的值本身替换。组合子样本方法背后的主要思想是通过生成许多子样本,然后组合从这些子样本构建的估计值,从而在不触及计算瓶颈的情况下更好地利用数据中的信息。组合子样本方法在不估计三明治矩阵的情况下获得参数估计量的标准误差,这为海量数据中的统计推断提供了便利,并且可以显著提高估计效率。建立了所得估计量的渐近性质。进行了模拟和实际数据分析,以评估和展示所提出方法的实际性能。

相似文献

1
Robust and efficient subsampling algorithms for massive data logistic regression.
J Appl Stat. 2023 Apr 26;51(8):1427-1445. doi: 10.1080/02664763.2023.2205611. eCollection 2024.
2
Optimal Subsampling for Large Sample Logistic Regression.
J Am Stat Assoc. 2018;113(522):829-844. doi: 10.1080/01621459.2017.1292914. Epub 2018 Jun 6.
3
Optimal subsampling for parametric accelerated failure time models with massive survival data.
Stat Med. 2022 Nov 30;41(27):5421-5431. doi: 10.1002/sim.9576. Epub 2022 Sep 20.
4
Asymptotics of Subsampling for Generalized Linear Regression Models under Unbounded Design.
Entropy (Basel). 2022 Dec 31;25(1):84. doi: 10.3390/e25010084.
5
Sampling-based estimation for massive survival data with additive hazards model.
Stat Med. 2021 Jan 30;40(2):441-450. doi: 10.1002/sim.8783. Epub 2020 Nov 3.
6
Collaborative double robust targeted maximum likelihood estimation.
Int J Biostat. 2010 May 17;6(1):Article 17. doi: 10.2202/1557-4679.1181.
7
Markov Subsampling Based on Huber Criterion.
IEEE Trans Neural Netw Learn Syst. 2024 Feb;35(2):2250-2262. doi: 10.1109/TNNLS.2022.3189069. Epub 2024 Feb 5.
9
LOCAL CASE-CONTROL SAMPLING: EFFICIENT SUBSAMPLING IN IMBALANCED DATA SETS.
Ann Stat. 2014 Oct 1;42(5):1693-1724. doi: 10.1214/14-AOS1220.
10
Double Robust Efficient Estimators of Longitudinal Treatment Effects: Comparative Performance in Simulations and a Case Study.
Int J Biostat. 2019 Feb 26;15(2):/j/ijb.2019.15.issue-2/ijb-2017-0054/ijb-2017-0054.xml. doi: 10.1515/ijb-2017-0054.

本文引用的文献

1
Optimal Subsampling for Large Sample Logistic Regression.
J Am Stat Assoc. 2018;113(522):829-844. doi: 10.1080/01621459.2017.1292914. Epub 2018 Jun 6.
2
LOCAL CASE-CONTROL SAMPLING: EFFICIENT SUBSAMPLING IN IMBALANCED DATA SETS.
Ann Stat. 2014 Oct 1;42(5):1693-1724. doi: 10.1214/14-AOS1220.
3
Tuning parameter selectors for the smoothly clipped absolute deviation method.
Biometrika. 2007 Aug 1;94(3):553-568. doi: 10.1093/biomet/asm053.

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验