Chen C, Chock D P, Winkler S L
Ford Research Laboratory, Dearborn, MI 48121 USA.
Environ Health Perspect. 1999 Mar;107(3):217-22. doi: 10.1289/ehp.99107217.
Confounding between the model covariates and causal variables (which may or may not be included as model covariates) is a well-known problem in regression models used in air pollution epidemiology. This problem is usually acknowledged but hardly ever investigated, especially in the context of generalized linear models. Using synthetic data sets, the present study shows how model overfit, underfit, and misfit in the presence of correlated causal variables in a Poisson regression model affect the estimated coefficients of the covariates and their confidence levels. The study also shows how this effect changes with the ranges of the covariates and the sample size. There is qualitative agreement between these study results and the corresponding expressions in the large-sample limit for the ordinary linear models. Confounding of covariates in an overfitted model (with covariates encompassing more than just the causal variables) does not bias the estimated coefficients but reduces their significance. The effect of model underfit (with some causal variables excluded as covariates) or misfit (with covariates encompassing only noncausal variables), on the other hand, leads to not only erroneous estimated coefficients, but a misguided confidence, represented by large t-values, that the estimated coefficients are significant. The results of this study indicate that models which use only one or two air quality variables, such as particulate matter [less than and equal to] 10 microm and sulfur dioxide, are probably unreliable, and that models containing several correlated and toxic or potentially toxic air quality variables should also be investigated in order to minimize the situation of model underfit or misfit.
模型协变量与因果变量(因果变量可能作为模型协变量包含在内,也可能不包含)之间的混杂是空气污染流行病学中使用的回归模型中一个众所周知的问题。这个问题通常得到承认,但几乎从未被研究过,尤其是在广义线性模型的背景下。本研究使用合成数据集展示了泊松回归模型中存在相关因果变量时模型的过拟合、欠拟合和失配如何影响协变量的估计系数及其置信水平。该研究还展示了这种影响如何随协变量范围和样本量的变化而变化。这些研究结果与普通线性模型在大样本极限下的相应表达式在定性上是一致的。过拟合模型(协变量不仅包括因果变量)中协变量的混杂不会使估计系数产生偏差,但会降低其显著性。另一方面,模型欠拟合(一些因果变量被排除在协变量之外)或失配(协变量仅包括非因果变量)的影响不仅会导致估计系数错误,还会导致一种被误导的置信度,表现为较大的t值,即认为估计系数是显著的。本研究结果表明,仅使用一两个空气质量变量(如小于等于10微米的颗粒物和二氧化硫)的模型可能不可靠,并且还应研究包含几个相关且有毒或潜在有毒空气质量变量的模型,以尽量减少模型欠拟合或失配的情况。