Pacca Lucia, Dang Kristina V, Koenig Leah, dP Duarte Catherine, Gaye S Amina, Harrati Amal, Vable Anusha M
Department of Family and Community Medicine, University of California San Francisco, 2540 23rd St, San Francisco, CA 94110, United States.
University of Southern California, Los Angeles, California 90007.
Am J Epidemiol. 2025 Apr 10. doi: 10.1093/aje/kwaf065.
Characterizing longitudinal trajectories of variables that unfold over time (e.g. social, health or environmental variables) is a persistent challenge, but can be accomplished with sequence and cluster analysis, data-driven approaches that can differentiate timing, order and duration of events. We present practical guidance on implementing sequence and cluster analysis for epidemiologists with the goal of providing clear advice on decision points and tradeoffs. We introduce the three main steps of sequence and cluster analysis: (1) coding trajectories of ordered events (data cleaning); (2) measuring dissimilarity between trajectories (sequence analysis); and (3) grouping similar trajectories (cluster analysis). Each of these steps presents researchers with several decision points, such as data cleaning rules, options for evaluating sequence dissimilarity, and choices of clustering algorithms. After outlining each of the sequence analysis steps, we provide an applied example of sequence analysis in which we create and group transition-to-retirement trajectories from age 51-75 for a sample of 9,189 Health and Retirement Study participants using self-reported employment information, then estimate the association between transition-to-retirement groups and self-rated health. We seek to provide an initial guide for epidemiologists through analytic decisions and implementation challenges of sequence analysis as this approach is increasingly implemented and undergoes methodological advances.
描述随时间变化的变量(如社会、健康或环境变量)的纵向轨迹是一项长期挑战,但可以通过序列和聚类分析来实现,这是一种数据驱动的方法,能够区分事件的时间、顺序和持续时间。我们为流行病学家提供关于实施序列和聚类分析的实用指南,目的是在决策点和权衡方面提供明确建议。我们介绍序列和聚类分析的三个主要步骤:(1)对有序事件的轨迹进行编码(数据清理);(2)测量轨迹之间的差异(序列分析);(3)对相似轨迹进行分组(聚类分析)。这些步骤中的每一步都给研究人员带来了几个决策点,例如数据清理规则、评估序列差异的选项以及聚类算法的选择。在概述了每个序列分析步骤之后,我们提供了一个序列分析的应用示例,在该示例中,我们使用自我报告的就业信息,为9189名健康与退休研究参与者的样本创建并分组了51岁至75岁的退休过渡轨迹,然后估计退休过渡组与自评健康之间的关联。随着这种方法越来越多地被采用并在方法上取得进展,我们试图通过序列分析的分析决策和实施挑战,为流行病学家提供一个初步指南。