Gong Tieliang, Dong Yuxin, Chen Hong, Dong Bo, Li Chen
IEEE Trans Neural Netw Learn Syst. 2024 Feb;35(2):2250-2262. doi: 10.1109/TNNLS.2022.3189069. Epub 2024 Feb 5.
Subsampling is an important technique to tackle the computational challenges brought by big data. Many subsampling procedures fall within the framework of importance sampling, which assigns high sampling probabilities to the samples appearing to have big impacts. When the noise level is high, those sampling procedures tend to pick many outliers and thus often do not perform satisfactorily in practice. To tackle this issue, we design a new Markov subsampling strategy based on Huber criterion (HMS) to construct an informative subset from the noisy full data; the constructed subset then serves as refined working data for efficient processing. HMS is built upon a Metropolis-Hasting procedure, where the inclusion probability of each sampling unit is determined using the Huber criterion to prevent over scoring the outliers. Under mild conditions, we show that the estimator based on the subsamples selected by HMS is statistically consistent with a sub-Gaussian deviation bound. The promising performance of HMS is demonstrated by extensive studies on large-scale simulations and real data examples.
子采样是应对大数据带来的计算挑战的一项重要技术。许多子采样过程都属于重要性采样框架,该框架会给那些似乎有重大影响的样本赋予高采样概率。当噪声水平较高时,那些采样过程往往会选取许多离群值,因此在实际中往往表现不佳。为了解决这个问题,我们基于Huber准则设计了一种新的马尔可夫子采样策略(HMS),以便从有噪声的完整数据中构建一个信息丰富的子集;然后,构建的子集将作为经过细化的工作数据用于高效处理。HMS基于一个Metropolis-Hasting过程构建,其中每个采样单元的包含概率使用Huber准则来确定,以防止对离群值过度评分。在温和条件下,我们表明基于HMS选择的子样本的估计量在统计上是一致的,且具有次高斯偏差界。通过对大规模模拟和实际数据示例的广泛研究,证明了HMS具有良好的性能。