Borkowf Craig B, Albert Paul S
Mathematical Statistician, Centers for Disease Control and Prevention (CDC), National Center for Infectious Diseases, Division of Viral and Rickettsial Diseases, Influenza Branch, Epidemiology Section, Atlanta, GA 30333, USA.
Stat Med. 2005 Feb 28;24(4):623-45. doi: 10.1002/sim.2041.
Suppose that one wishes to make inference to the risk of a disease by the population quartile-categories of a key continuous predictor variable. When one collects data on a prospective cohort, the standard method is simply to categorize the key predictor variable by the empirical quartiles. One may then include indicator variables for these empirical quartile-categories as predictors, along with other covariates, in a generalized linear model (GLM), with the observed health status of each subject as the response. The standard GLM method, however, is relatively inefficient, because it treats all observations that fall in the same quartile-category of the predictor variable identically, regardless of whether they lie in the centre or near the boundaries of that category. Alternatively, one may include the key predictor variable, along with other covariates, in a generalized additive model (GAM), again with the observed health status of each subject as the response. The alternative GAM method non-parametrically estimates the functional relationship between the key predictor variable and the response. One may then compute statistics of interest, such as proportions and odds ratios, from the fitted GAM equation using the empirical quartile-categories. Simulations show that both the GLM and GAM methods are nearly unbiased, but the latter method produces smaller variances and narrower bootstrap confidence intervals. An example from nutritional epidemiology illustrates the use of these methods.
假设有人希望通过关键连续预测变量的人群四分位数类别来推断疾病风险。当对前瞻性队列收集数据时,标准方法是简单地根据经验四分位数对关键预测变量进行分类。然后,可以在广义线性模型(GLM)中纳入这些经验四分位数类别的指示变量作为预测因子,以及其他协变量,将每个受试者的观察到的健康状况作为响应变量。然而,标准的GLM方法效率相对较低,因为它对落在预测变量同一四分位数类别的所有观察值一视同仁,无论它们位于该类别的中心还是接近边界。或者,可以在广义相加模型(GAM)中纳入关键预测变量以及其他协变量,同样将每个受试者的观察到的健康状况作为响应变量。替代的GAM方法非参数地估计关键预测变量与响应变量之间的函数关系。然后,可以使用经验四分位数类别从拟合的GAM方程中计算感兴趣的统计量,如比例和比值比。模拟表明,GLM和GAM方法几乎都是无偏的,但后一种方法产生的方差较小,自举置信区间较窄。营养流行病学的一个例子说明了这些方法的应用。