Wang Jing, Wang HaiYing, Zhang Hao Helen
Department of Statistics, University of Connecticut, Storrs, CT 06269.
Department of Mathematics, University of Arizona.
Adv Neural Inf Process Syst. 2024;37:98384-98418.
Subsampling is effective in tackling computational challenges for massive data with rare events. Overly aggressive subsampling may adversely affect estimation efficiency, and optimal subsampling is essential to mitigate the information loss. However, existing optimal subsampling probabilities depends on data scales, and some scaling transformations may result in inefficient subsamples. This problem is more significant when there are inactive features, because their influence on the subsampling probabilities can be arbitrarily magnified by inappropriate scaling transformations. We tackle this challenge and introduce a scale-invariant optimal subsampling function in the context of sparse models, where inactive features are commonly assumed. Instead of focusing on estimating model parameters, we define an optimal subsampling function to minimize the prediction error, using adaptive lasso as an example to outline the estimation procedure and study its theoretical guarantee. We first introduce the adaptive lasso estimator for rare-events data and establish its oracle properties, thereby validating the use of subsampling. Then we derive a scale-invariant optimal subsampling function that minimizes the prediction error of the inverse probability weighted (IPW) adaptive lasso. Finally, we present an estimator based on the maximum sampled conditional likelihood (MSCL) to further improve the estimation efficiency. We conduct numerical experiments using both simulated and real-world data sets to demonstrate the performance of the proposed methods.
子采样对于处理包含罕见事件的海量数据的计算挑战是有效的。过度激进的子采样可能会对估计效率产生不利影响,而最优子采样对于减轻信息损失至关重要。然而,现有的最优子采样概率依赖于数据规模,并且一些缩放变换可能会导致低效的子样本。当存在非活跃特征时,这个问题会更加显著,因为不适当的缩放变换可能会任意放大它们对子采样概率的影响。我们应对这一挑战,并在稀疏模型的背景下引入一种尺度不变的最优子采样函数,在该模型中通常假定存在非活跃特征。我们不是专注于估计模型参数,而是定义一个最优子采样函数以最小化预测误差,以自适应套索为例概述估计过程并研究其理论保证。我们首先为罕见事件数据引入自适应套索估计器并建立其神谕性质,从而验证子采样的使用。然后我们推导一个尺度不变的最优子采样函数,该函数可最小化逆概率加权(IPW)自适应套索的预测误差。最后,我们提出一种基于最大采样条件似然(MSCL)的估计器,以进一步提高估计效率。我们使用模拟数据集和真实世界数据集进行数值实验,以证明所提出方法的性能。