逻辑回归中的分离：原因、后果与控制。

Separation in Logistic Regression: Causes, Consequences, and Control.

机构信息

Department of Epidemiology and Biostatistics, School of Public Health, Tehran University of Medical Sciences, Tehran, Iran.

Section for Clinical Biometrics, Center for Medical Statistics, Informatics and Intelligent Systems, Medical University of Vienna, Austria.

出版信息

Am J Epidemiol. 2018 Apr 1;187(4):864-870. doi: 10.1093/aje/kwx299.

DOI:10.1093/aje/kwx299

PMID:29020135

Abstract

Separation is encountered in regression models with a discrete outcome (such as logistic regression) where the covariates perfectly predict the outcome. It is most frequent under the same conditions that lead to small-sample and sparse-data bias, such as presence of a rare outcome, rare exposures, highly correlated covariates, or covariates with strong effects. In theory, separation will produce infinite estimates for some coefficients. In practice, however, separation may be unnoticed or mishandled because of software limits in recognizing and handling the problem and in notifying the user. We discuss causes of separation in logistic regression and describe how common software packages deal with it. We then describe methods that remove separation, focusing on the same penalized-likelihood techniques used to address more general sparse-data problems. These methods improve accuracy, avoid software problems, and allow interpretation as Bayesian analyses with weakly informative priors. We discuss likelihood penalties, including some that can be implemented easily with any software package, and their relative advantages and disadvantages. We provide an illustration of ideas and methods using data from a case-control study of contraceptive practices and urinary tract infection.

摘要

在具有离散结果（如逻辑回归）的回归模型中会遇到分离情况，其中协变量可以完美预测结果。在同样的条件下，分离情况最为常见，这些条件会导致小样本和稀疏数据偏差，例如罕见结果、罕见暴露、高度相关的协变量或具有强烈影响的协变量。从理论上讲，对于某些系数，分离会产生无限估计值。然而，在实践中，由于软件在识别和处理问题以及通知用户方面的限制，分离可能会被忽略或处理不当。我们讨论了逻辑回归中分离的原因，并描述了常见的软件包如何处理它。然后，我们描述了消除分离的方法，重点介绍了用于解决更普遍的稀疏数据问题的相同惩罚似然技术。这些方法可以提高准确性，避免软件问题，并允许作为具有弱信息先验的贝叶斯分析进行解释。我们讨论了似然惩罚，包括一些可以用任何软件包轻松实现的惩罚，以及它们的相对优缺点。我们使用避孕实践和尿路感染病例对照研究的数据来说明想法和方法。