Division of Biostatistics, School of Public Health, University of California at Berkeley, Berkeley, California, United States.
Office of Biostatistics, Center for Drug Evaluation and Research, United States Food and Drug Administration, Silver Spring, Maryland, United States.
Int J Epidemiol. 2023 Aug 2;52(4):1276-1285. doi: 10.1093/ije/dyad023.
Common tasks encountered in epidemiology, including disease incidence estimation and causal inference, rely on predictive modelling. Constructing a predictive model can be thought of as learning a prediction function (a function that takes as input covariate data and outputs a predicted value). Many strategies for learning prediction functions from data (learners) are available, from parametric regressions to machine learning algorithms. It can be challenging to choose a learner, as it is impossible to know in advance which one is the most suitable for a particular dataset and prediction task. The super learner (SL) is an algorithm that alleviates concerns over selecting the one 'right' learner by providing the freedom to consider many, such as those recommended by collaborators, used in related research or specified by subject-matter experts. Also known as stacking, SL is an entirely prespecified and flexible approach for predictive modelling. To ensure the SL is well specified for learning the desired prediction function, the analyst does need to make a few important choices. In this educational article, we provide step-by-step guidelines for making these decisions, walking the reader through each of them and providing intuition along the way. In doing so, we aim to empower the analyst to tailor the SL specification to their prediction task, thereby ensuring their SL performs as well as possible. A flowchart provides a concise, easy-to-follow summary of key suggestions and heuristics, based on our accumulated experience and guided by SL optimality theory.
在流行病学中常见的任务,包括疾病发病率估计和因果推断,都依赖于预测建模。构建预测模型可以被视为学习一个预测函数(一个将协变量数据作为输入并输出预测值的函数)。有许多从数据中学习预测函数的策略(学习者),从参数回归到机器学习算法。选择一个学习者可能具有挑战性,因为不可能事先知道哪个学习者最适合特定数据集和预测任务。超级学习者(SL)是一种算法,它通过提供考虑许多学习者的自由来缓解选择一个“正确”学习者的问题,例如合作者推荐的、在相关研究中使用的或由主题专家指定的学习者。SL 也称为堆叠,是一种用于预测建模的完全预先指定和灵活的方法。为了确保 SL 能够很好地学习所需的预测函数,分析师确实需要做出一些重要的选择。在这篇教育文章中,我们提供了做出这些决策的逐步指南,引导读者完成每一个决策,并提供相关的直观理解。通过这样做,我们旨在赋予分析师根据其预测任务调整 SL 规范的能力,从而确保他们的 SL 能够尽可能地发挥作用。基于我们的积累经验和 SL 最优理论指导,一个流程图提供了一个简洁、易于遵循的关键建议和启发式的摘要。