在对混杂变量进行分类时，分类方法、回归类型和变量分布对I型错误率膨胀的影响。

Effects of categorization method, regression type, and variable distribution on the inflation of Type-I error rate when categorizing a confounding variable.

作者信息

Barnwell-Ménard Jean-Louis, Li Qing, Cohen Alan A

机构信息

Department of Economics, University of Sherbrooke, Sherbrooke, QC, Canada.

出版信息

Stat Med. 2015 Mar 15;34(6):936-49. doi: 10.1002/sim.6387. Epub 2014 Dec 11.

DOI:10.1002/sim.6387

PMID:25504513

Abstract

The loss of signal associated with categorizing a continuous variable is well known, and previous studies have demonstrated that this can lead to an inflation of Type-I error when the categorized variable is a confounder in a regression analysis estimating the effect of an exposure on an outcome. However, it is not known how the Type-I error may vary under different circumstances, including logistic versus linear regression, different distributions of the confounder, and different categorization methods. Here, we analytically quantified the effect of categorization and then performed a series of 9600 Monte Carlo simulations to estimate the Type-I error inflation associated with categorization of a confounder under different regression scenarios. We show that Type-I error is unacceptably high (>10% in most scenarios and often 100%). The only exception was when the variable categorized was a continuous mixture proxy for a genuinely dichotomous latent variable, where both the continuous proxy and the categorized variable are error-ridden proxies for the dichotomous latent variable. As expected, error inflation was also higher with larger sample size, fewer categories, and stronger associations between the confounder and the exposure or outcome. We provide online tools that can help researchers estimate the potential error inflation and understand how serious a problem this is.

摘要

对连续变量进行分类所导致的信号损失是众所周知的，并且先前的研究已经表明，当分类变量在估计暴露对结局的影响的回归分析中作为混杂因素时，这可能会导致第一类错误的膨胀。然而，尚不清楚在不同情况下第一类错误如何变化，包括逻辑回归与线性回归、混杂因素的不同分布以及不同的分类方法。在此，我们通过分析量化了分类的影响，然后进行了一系列9600次蒙特卡洛模拟，以估计在不同回归场景下与混杂因素分类相关的第一类错误膨胀。我们表明，第一类错误高得令人无法接受（在大多数情况下>10%，且常常为100%）。唯一的例外是当分类变量是真正二分潜在变量的连续混合代理时，其中连续代理和分类变量都是二分潜在变量的有误差代理。正如预期的那样，样本量越大、类别越少以及混杂因素与暴露或结局之间的关联越强，错误膨胀也越高。我们提供了在线工具，可帮助研究人员估计潜在的错误膨胀，并了解这一问题的严重程度。