Division of Environmental Epidemiology, Institute for Risk Assessment Sciences, Utrecht University, Utrecht, The Netherlands.
Departmentof Epidemiology, Julius Center for Health Sciences and Primary Care, University Medical Center Utrecht, Utrecht, The Netherlands.
Occup Environ Med. 2018 Jul;75(7):522-529. doi: 10.1136/oemed-2016-104231. Epub 2017 Sep 25.
There is growing recognition that simultaneously assessing multiple exposures may reduce false positive discoveries and improve epidemiological effect estimates. We evaluated the performance of statistical methods for identifying exposure-outcome associations across various data structures typical of environmental and occupational epidemiology analyses.
We simulated a case-control study, generating 100 data sets for each of 270 different simulation scenarios; varying the number of exposure variables, the correlation between exposures, sample size, the number of effective exposures and the magnitude of effect estimates. We compared conventional analytical approaches, that is, univariable (with and without multiplicity adjustment), multivariable and stepwise logistic regression, with variable selection methods: sparse partial least squares discriminant analysis, boosting, and frequentist and Bayesian penalised regression approaches.
The variable selection methods consistently yielded more precise effect estimates and generally improved selection accuracy compared with conventional logistic regression methods, especially for scenarios with higher correlation levels. Penalised lasso and elastic net regression both seemed to perform particularly well, specifically when statistical inference based on a balanced weighting of high sensitivity and a low proportion of false discoveries is sought.
In this extensive simulation study with multicollinear data, we found that most variable selection methods consistently outperformed conventional approaches, and demonstrated how performance is influenced by the structure of the data and underlying model.
越来越多的人认识到,同时评估多种暴露因素可能会减少假阳性发现,并提高流行病学效应估计的准确性。我们评估了用于识别各种环境和职业流行病学分析中典型数据结构的暴露-结局关联的统计方法的性能。
我们模拟了一项病例对照研究,为 270 种不同模拟情况中的每一种生成了 100 个数据集;改变暴露因素的数量、暴露因素之间的相关性、样本量、有效暴露因素的数量和效应估计值的大小。我们比较了传统的分析方法,即单变量(有无多重性调整)、多变量和逐步逻辑回归,以及变量选择方法:稀疏偏最小二乘判别分析、提升法和频率派及贝叶斯惩罚回归方法。
与传统的逻辑回归方法相比,变量选择方法始终产生更精确的效应估计值,并且通常提高了选择准确性,尤其是在相关性水平较高的情况下。惩罚最小二乘法和弹性网络回归似乎都表现得特别好,特别是在寻求基于高灵敏度和低假阳性发现比例的平衡加权的统计推断时。
在这项具有多重共线性数据的广泛模拟研究中,我们发现大多数变量选择方法始终优于传统方法,并展示了性能如何受到数据结构和基础模型的影响。