Greenland S
Division of Epidemiology, University of California, School of Public Health, Los Angeles 90024.
Am J Public Health. 1989 Mar;79(3):340-9. doi: 10.2105/ajph.79.3.340.
This paper provides an overview of problems in multivariate modeling of epidemiologic data, and examines some proposed solutions. Special attention is given to the task of model selection, which involves selection of the model form, selection of the variables to enter the model, and selection of the form of these variables in the model. Several conclusions are drawn, among them: a) model and variable forms should be selected based on regression diagnostic procedures, in addition to goodness-of-fit tests; b) variable-selection algorithms in current packaged programs, such as conventional stepwise regression, can easily lead to invalid estimates and tests of effect; and c) variable selection is better approached by direct estimation of the degree of confounding produced by each variable than by significance-testing algorithms. As a general rule, before using a model to estimate effects, one should evaluate the assumptions implied by the model against both the data and prior information.
本文概述了流行病学数据多变量建模中的问题,并探讨了一些提出的解决方案。特别关注模型选择任务,其中包括模型形式的选择、进入模型的变量的选择以及这些变量在模型中的形式的选择。得出了几个结论,其中包括:a)除了拟合优度检验外,还应基于回归诊断程序选择模型和变量形式;b)当前打包程序中的变量选择算法,如传统的逐步回归,很容易导致无效的估计和效应检验;c)通过直接估计每个变量产生的混杂程度比通过显著性检验算法进行变量选择更好。一般来说,在使用模型估计效应之前,应该根据数据和先验信息评估模型所隐含的假设。