From the Department of Environmental Health and Engineering, Bloomberg School of Public Health, Johns Hopkins, Baltimore, MD.
Department of Epidemiology, Gillings School of Global Public Health, University of North Carolina, Chapel Hill, NC.
Epidemiology. 2022 Nov 1;33(6):843-853. doi: 10.1097/EDE.0000000000001534. Epub 2022 Oct 5.
Epidemiologic studies often quantify exposure using biomarkers, which commonly have statistically skewed distributions. Although normality assumption is not required if the biomarker is used as an independent variable in linear regression, it has become common practice to log-transform the biomarker concentrations. This transformation can be motivated by concerns for nonlinear dose-response relationship or outliers; however, such transformation may not always reduce bias. In this study, we evaluated the validity of motivations underlying the decision to log-transform an independent variable using simulations, considering eight scenarios that can give rise to skewed X and normal Y. Our simulation study demonstrates that (1) if the skewness of exposure did not arise from a biasing factor (e.g., measurement error), the analytic approach with the best overall model fit best reflected the underlying outcome generating methods and was least biased, regardless of the skewness of X and (2) all estimates were biased if the skewness of exposure was a consequence of a biasing factor. We additionally illustrate a process to determine whether the transformation of an independent variable is needed using NHANES. Our study and suggestion to divorce the shape of the exposure distribution from the decision to log-transform it may aid researchers in planning for analysis using biomarkers or other skewed independent variables.
流行病学研究通常使用生物标志物来量化暴露,而生物标志物的分布通常存在统计学上的偏态。虽然在线性回归中,生物标志物作为自变量时不需要正态性假设,但将生物标志物浓度进行对数转换已成为一种常见做法。这种转换可能是出于对非线性剂量反应关系或异常值的担忧;然而,这种转换并不总是能减少偏差。在这项研究中,我们通过模拟评估了使用生物标志物作为自变量时进行对数转换的决策背后的动机的有效性,考虑了八种可能导致 X 偏态和 Y 正态的情况。我们的模拟研究表明:(1)如果暴露的偏态不是由偏倚因素(例如测量误差)引起的,那么具有最佳整体模型拟合度的分析方法最能反映潜在的结果生成方法,且偏差最小,而与 X 的偏态无关;(2)如果暴露的偏态是由偏倚因素引起的,那么所有的估计值都是有偏差的。我们还通过 NHANES 说明了一种确定是否需要转换自变量的过程。我们的研究和建议将暴露分布的形状与对数转换的决策分开,这可能有助于研究人员计划使用生物标志物或其他偏态自变量进行分析。