Goldstein I F, Fleiss J L, Goldstein M, Landovitz L
Environ Health Perspect. 1979 Oct;32:311-5. doi: 10.1289/ehp.7932311.
In epidemiological studies using linear regression, it is often necessary for reasons of economy or unavailability of data to use as the independent variable not the variable ideally demanded by the hypothesis under study but some convenient practical approximation to it. We show that if the correlation coefficient between the "practical" and "ideal" variables can be obtained, then a range of uncertainty can be obtained within which the desired regression coefficient of dependent on "ideal" variable may lie. This range can be quite wide, even if the practical and ideal variables are fairly well correlated. These points are illustrated with data on observed regression coefficients from an air pollution epidemiological study, in which pollution measured at one station in a large metropolitan area (containing 40 aerometric stations) was used as the practical approximation to the city-wide average pollution. The uncertainties in the regression coefficients were found to exceed the regression coefficients themselves by large factors. The problem is one that may afflict application of linear regression in general, and suggests caution when selecting independent variables for regression analysis on the basis of convenience, rather than relevance to the hypotheses tested.
在使用线性回归的流行病学研究中,出于经济原因或数据不可得,常常有必要将并非研究假设理想要求的变量而是其某种方便的实际近似值用作自变量。我们表明,如果能够获得“实际”变量与“理想”变量之间的相关系数,那么就可以得到一个不确定范围,所期望的依赖于“理想”变量的回归系数可能落在这个范围内。即使实际变量和理想变量相关性相当好,这个范围也可能相当宽。这些观点通过一项空气污染流行病学研究中观测到的回归系数数据得以说明,在该研究中,大城市一个站点(该大城市包含40个空气监测站)测量的污染被用作全市平均污染的实际近似值。结果发现,回归系数中的不确定性超过回归系数本身很多倍。这个问题可能普遍困扰线性回归的应用,这表明在基于便利性而非与所检验假设的相关性来选择回归分析的自变量时应谨慎。