一种用于中小规模数据集逻辑回归推断的排列检验。

A permutation test for inference in logistic regression with small- and moderate-sized data sets.

作者信息

Potter Douglas M

机构信息

Biostatistics Department, Graduate School of Public Health, and Biostatistics Facility, University of Pittsburgh Cancer Institute, University of Pittsburgh, Suite 325, Sterling Plaza, 201 North Craig Street, Pittsburgh, PA 15213, USA.

出版信息

Stat Med. 2005 Mar 15;24(5):693-708. doi: 10.1002/sim.1931.

DOI:10.1002/sim.1931

PMID:15515134

Abstract

Inference based on large sample results can be highly inaccurate if applied to logistic regression with small data sets. Furthermore, maximum likelihood estimates for the regression parameters will on occasion not exist, and large sample results will be invalid. Exact conditional logistic regression is an alternative that can be used whether or not maximum likelihood estimates exist, but can be overly conservative. This approach also requires grouping the values of continuous variables corresponding to nuisance parameters, and inference can depend on how this is done. A simple permutation test of the hypothesis that a regression parameter is zero can overcome these limitations. The variable of interest is replaced by the residuals from a linear regression of it on all other independent variables. Logistic regressions are then done for permutations of these residuals, and a p-value is computed by comparing the resulting likelihood ratio statistics to the original observed value. Simulations of binary outcome data with two independent variables that have binary or lognormal distributions yield the following results: (a) in small data sets consisting of 20 observations, type I error is well-controlled by the permutation test, but poorly controlled by the asymptotic likelihood ratio test; (b) in large data sets consisting of 1000 observations, performance of the permutation test appears equivalent to that of the asymptotic test; and (c) in small data sets, the p-value for the permutation test is usually similar to the mid-p-value for exact conditional logistic regression.

摘要

如果将基于大样本结果的推断应用于小数据集的逻辑回归，可能会极不准确。此外，回归参数的最大似然估计有时不存在，大样本结果也将无效。精确条件逻辑回归是一种替代方法，无论最大似然估计是否存在都可以使用，但可能过于保守。这种方法还需要对与干扰参数对应的连续变量的值进行分组，并且推断可能取决于分组方式。对回归参数为零的假设进行简单的置换检验可以克服这些局限性。将感兴趣的变量替换为它对所有其他自变量进行线性回归得到的残差。然后对这些残差的排列进行逻辑回归，并通过将得到的似然比统计量与原始观测值进行比较来计算p值。对具有二元或对数正态分布的两个自变量的二元结果数据进行模拟，得到以下结果：(a) 在由20个观测值组成的小数据集中，置换检验能很好地控制I型错误，但渐近似然比检验对其控制不佳；(b) 在由1000个观测值组成的大数据集中，置换检验的性能似乎与渐近检验相当；(c) 在小数据集中，置换检验的p值通常与精确条件逻辑回归的中p值相似。