Wang Rui, Lagakos Stephen W
Department of Biostatistics, Harvard School of Public Health, Boston, MA 02115, USA.
Can J Stat. 2009 Dec 1;37(4):625-644. doi: 10.1002/cjs.10039.
When confronted with multiple covariates and a response variable, analysts sometimes apply a variable-selection algorithm to the covariate-response data to identify a subset of covariates potentially associated with the response, and then wish to make inferences about parameters in a model for the marginal association between the selected covariates and the response. If an independent data set were available, the parameters of interest could be estimated by using standard inference methods to fit the postulated marginal model to the independent data set. However, when applied to the same data set used by the variable selector, standard ("naive") methods can lead to distorted inferences. The authors develop testing and interval estimation methods for parameters reflecting the marginal association between the selected covariates and response variable, based on the same data set used for variable selection. They provide theoretical justification for the proposed methods, present results to guide their implementation, and use simulations to assess and compare their performance to a sample-splitting approach. The methods are illustrated with data from a recent AIDS study.
当面对多个协变量和一个响应变量时,分析人员有时会对协变量 - 响应数据应用变量选择算法,以识别可能与响应相关的协变量子集,然后希望对所选协变量与响应之间的边际关联模型中的参数进行推断。如果有一个独立的数据集,感兴趣的参数可以通过使用标准推断方法将假定的边际模型拟合到独立数据集来估计。然而,当应用于变量选择器所使用的同一数据集时,标准(“朴素”)方法可能会导致扭曲的推断。作者基于用于变量选择的同一数据集,开发了用于反映所选协变量与响应变量之间边际关联的参数的检验和区间估计方法。他们为所提出的方法提供了理论依据,给出了指导其实施的结果,并使用模拟来评估和比较它们与样本分割方法的性能。这些方法通过最近一项艾滋病研究的数据进行了说明。