Leeds Institute for Data Analytics, University of Leeds, Leeds, UK.
School of Medicine, University of Leeds, Leeds, UK.
Int J Epidemiol. 2021 Jan 23;49(6):2074-2082. doi: 10.1093/ije/dyaa049.
Prediction and causal explanation are fundamentally distinct tasks of data analysis. In health applications, this difference can be understood in terms of the difference between prognosis (prediction) and prevention/treatment (causal explanation). Nevertheless, these two concepts are often conflated in practice. We use the framework of generalized linear models (GLMs) to illustrate that predictive and causal queries require distinct processes for their application and subsequent interpretation of results. In particular, we identify five primary ways in which GLMs for prediction differ from GLMs for causal inference: (i) the covariates that should be considered for inclusion in (and possibly exclusion from) the model; (ii) how a suitable set of covariates to include in the model is determined; (iii) which covariates are ultimately selected and what functional form (i.e. parameterization) they take; (iv) how the model is evaluated; and (v) how the model is interpreted. We outline some of the potential consequences of failing to acknowledge and respect these differences, and additionally consider the implications for machine learning (ML) methods. We then conclude with three recommendations that we hope will help ensure that both prediction and causal modelling are used appropriately and to greatest effect in health research.
预测和因果解释是数据分析的两个截然不同的任务。在健康应用中,可以根据预后(预测)和预防/治疗(因果解释)之间的区别来理解这种差异。然而,这两个概念在实践中经常被混淆。我们使用广义线性模型(GLM)的框架来说明预测性和因果性查询需要不同的过程来应用和随后解释结果。具体来说,我们确定了预测用 GLM 和因果推断用 GLM 之间存在五个主要差异:(i)应考虑包含(和可能排除)在模型中的协变量;(ii)如何确定适合包含在模型中的协变量集;(iii)最终选择哪些协变量以及它们采用什么函数形式(即参数化);(iv)如何评估模型;以及(v)如何解释模型。我们概述了未能承认和尊重这些差异可能产生的一些后果,并进一步考虑了对机器学习(ML)方法的影响。最后,我们提出了三个建议,希望有助于确保在健康研究中适当地使用预测和因果建模,并最大程度地发挥其作用。