Schifano Elizabeth D, Wu Jing, Wang Chun, Yan Jun, Chen Ming-Hui
Department of Statistics, University of Connecticut.
Technometrics. 2016;58(3):393-403. doi: 10.1080/00401706.2016.1142900. Epub 2016 Jul 8.
We present statistical methods for big data arising from online analytical processing, where large amounts of data arrive in streams and require fast analysis without storage/access to the historical data. In particular, we develop iterative estimating algorithms and statistical inferences for linear models and estimating equations that update as new data arrive. These algorithms are computationally efficient, minimally storage-intensive, and allow for possible rank deficiencies in the subset design matrices due to rare-event covariates. Within the linear model setting, the proposed online-updating framework leads to predictive residual tests that can be used to assess the goodness-of-fit of the hypothesized model. We also propose a new online-updating estimator under the estimating equation setting. Theoretical properties of the goodness-of-fit tests and proposed estimators are examined in detail. In simulation studies and real data applications, our estimator compares favorably with competing approaches under the estimating equation setting.
我们提出了适用于在线分析处理产生的大数据的统计方法,其中大量数据以流的形式到达,并且需要在不存储/访问历史数据的情况下进行快速分析。特别是,我们针对线性模型和估计方程开发了迭代估计算法和统计推断,这些算法会随着新数据的到达而更新。这些算法计算效率高,存储需求极小,并且由于罕见事件协变量,允许子集设计矩阵中可能存在秩亏缺。在线性模型设置中,所提出的在线更新框架会产生预测残差检验,可用于评估假设模型的拟合优度。我们还在估计方程设置下提出了一种新的在线更新估计器。详细研究了拟合优度检验和所提出估计器的理论性质。在模拟研究和实际数据应用中,我们的估计器在估计方程设置下与竞争方法相比具有优势。