Dalla Lana School of Public Health, University of Toronto, Toronton, ON M5T 3M7, Canada.
Center for Addictions and Mental Health, Toronto, ON M6J 1H4, Canada.
Int J Environ Res Public Health. 2023 Jun 21;20(13):6194. doi: 10.3390/ijerph20136194.
There is a lack of rigorous methodological development for descriptive epidemiology, where the goal is to describe and identify the most important associations with an outcome given a large set of potential predictors. This has often led to the Table 2 fallacy, where one presents the coefficient estimates for all covariates from a single multivariable regression model, which are often uninterpretable in a descriptive analysis. We argue that machine learning (ML) is a potential solution to this problem. We illustrate the power of ML with an example analysis identifying the most important predictors of alcohol abuse among sexual minority youth. The framework we propose for this analysis is as follows: (1) Identify a few ML methods for the analysis, (2) optimize the parameters using the whole data with a nested cross-validation approach, (3) rank the variables using variable importance scores, (4) present partial dependence plots (PDP) to illustrate the association between the important variables and the outcome, (5) and identify the strength of the interaction terms using the PDPs. We discuss the potential strengths and weaknesses of using ML methods for descriptive analysis and future directions for research. R codes to reproduce these analyses are provided, which we invite other researchers to use.
描述性流行病学方法的发展缺乏严谨性,其目标是在给定大量潜在预测因子的情况下,描述和识别与结果最相关的因素。这往往导致了表 2 谬误,即从单变量回归模型中呈现所有协变量的系数估计值,而这些估计值在描述性分析中通常是不可解释的。我们认为,机器学习(ML)是解决这个问题的一种潜在方法。我们通过一个分析性示例来说明 ML 的强大功能,该示例确定了性少数青年酗酒的最重要预测因素。我们为这种分析提出的框架如下:(1)确定几种用于分析的 ML 方法,(2)使用嵌套交叉验证方法优化整个数据的参数,(3)使用变量重要性得分对变量进行排名,(4)展示部分依赖图(PDP)以说明重要变量与结果之间的关联,(5)并使用 PDP 来识别交互项的强度。我们讨论了使用 ML 方法进行描述性分析的潜在优势和劣势,以及未来的研究方向。我们提供了可重现这些分析的 R 代码,我们邀请其他研究人员使用这些代码。