Department of Biostatistics, University of Michigan, Washington Heights, Ann Arbor, MI, USA.
Department of Epidemiology, University of Michigan, Washington Heights, Ann Arbor, MI, USA.
Biostatistics. 2018 Oct 1;19(4):461-478. doi: 10.1093/biostatistics/kxx041.
Distributed lag models (DLMs) have been widely used in environmental epidemiology to quantify the lagged effects of air pollution on an outcome of interest such as mortality or cardiovascular events. Generally speaking, DLMs can be applied to time-series data where the current measure of an independent variable and its lagged measures collectively affect the current measure of a dependent variable. The corresponding distributed lag (DL) function represents the relationship between the lags and the coefficients of the lagged exposure variables. Common choices include polynomials and splines. On one hand, such a constrained DLM specifies the coefficients as a function of lags and reduces the number of parameters to be estimated; hence, higher efficiency can be achieved. On the other hand, under violation of the assumption about the DL function, effect estimates can be severely biased. In this article, we propose a general framework for shrinking coefficient estimates from an unconstrained DLM, that are unbiased but potentially inefficient, toward the coefficient estimates from a constrained DLM to achieve a bias-variance trade-off. The amount of shrinkage can be determined in various ways, and we explore several such methods: empirical Bayes-type shrinkage, a hierarchical Bayes approach, and generalized ridge regression. We also consider a two-stage shrinkage approach that enforces the effect estimates to approach zero as lags increase. We contrast the various methods via an extensive simulation study and show that the shrinkage methods have better average performance across different scenarios in terms of mean squared error (MSE).We illustrate the methods by using data from the National Morbidity, Mortality, and Air Pollution Study (NMMAPS) to explore the association between PM$_{10}$, O$_3$, and SO$_2$ on three types of disease event counts in Chicago, IL, from 1987 to 2000.
分布滞后模型(DLM)已广泛应用于环境流行病学,以量化空气污染对死亡率或心血管事件等感兴趣结局的滞后影响。一般来说,DLM 可应用于时间序列数据,其中自变量的当前测量值及其滞后测量值共同影响因变量的当前测量值。相应的分布滞后(DL)函数表示滞后与滞后暴露变量系数之间的关系。常见的选择包括多项式和样条。一方面,这种受约束的 DLM 将系数指定为滞后的函数,并减少了要估计的参数数量;因此,可以实现更高的效率。另一方面,在违反关于 DL 函数的假设的情况下,效应估计可能会严重偏倚。在本文中,我们提出了一个从无约束 DLM 中收缩系数估计的一般框架,这些系数估计是无偏的,但效率可能较低,以实现偏差方差的权衡。收缩量可以通过多种方式确定,我们探索了几种方法:经验贝叶斯收缩、层次贝叶斯方法和广义岭回归。我们还考虑了一种两阶段收缩方法,该方法强制效应估计随着滞后增加而趋近于零。我们通过广泛的模拟研究对比了各种方法,并表明在不同场景下,收缩方法在均方误差(MSE)方面具有更好的平均性能。我们通过使用来自国家发病率、死亡率和空气污染研究(NMMAPS)的数据来说明这些方法,以探索 PM$_{10}$、O$_3$和 SO$_2$与伊利诺伊州芝加哥三种疾病事件计数之间的关联,时间跨度为 1987 年至 2000 年。