Wang Chun, Chen Ming-Hui, Wu Jing, Yan Jun, Zhang Yuping, Schifano Elizabeth
Liberty Mutual Insurance, Boston, MA, USA.
Department of Statistics, University of Connecticut, Storrs, CT, USA.
Can J Stat. 2018 Mar;46(1):123-146. doi: 10.1002/cjs.11330. Epub 2017 Aug 9.
For big data arriving in streams, online updating is an important statistical method that breaks the storage barrier and the computational barrier under certain circumstances. In the regression context, online updating algorithms assume that the set of predictor variables does not change, and consequently cannot incorporate new variables that may become available midway through the data stream. A naive approach would be to discard all previous information and start updating with new variables from scratch. We propose a method that utilizes the information from earlier data in the online updating algorithm with bias corrections to improve efficiency. The method is developed for linear models first, and then extended to estimating equations for generalized linear models. Closed-form expressions for the efficiency gain over the naive approach are derived in a particular linear model setting. We compare the performance of our proposed bias-correcting approach and the naive approach in simulation studies with data generated from a normal linear model and a logistic regression model. The method is applied to a study on airline delay, where reasons for delays were only available more recently, starting in 2003.
对于以流形式到达的大数据,在线更新是一种重要的统计方法,在某些情况下它打破了存储障碍和计算障碍。在回归背景下,在线更新算法假定预测变量集不变,因此无法纳入可能在数据流中途变得可用的新变量。一种简单的方法是丢弃所有先前的信息,然后从头开始用新变量进行更新。我们提出一种方法,该方法在在线更新算法中利用早期数据的信息并进行偏差校正以提高效率。该方法首先针对线性模型开发,然后扩展到广义线性模型的估计方程。在特定的线性模型设置中,推导了相对于简单方法效率提升的闭式表达式。在模拟研究中,我们将所提出的偏差校正方法和简单方法的性能与从正态线性模型和逻辑回归模型生成的数据进行了比较。该方法应用于一项关于航班延误的研究,在该研究中,延误原因直到2003年才开始有更多数据可用。