Deployment Health Research Department, Naval Health Research Center, San Diego, CA, USA.
Leidos, Inc, San Diego, CA, USA.
Sci Rep. 2024 Oct 28;14(1):25764. doi: 10.1038/s41598-024-77563-8.
The Millennium Cohort Study is a longitudinal study which collects self-reported data from surveys to examine the long-term effects of military service. Participant nonresponse to follow-up surveys presents a potential threat to the validity and generalizability of study findings. In recent years, predictive analytics has emerged as a promising tool to identify predictors of nonresponse. Here, we develop a high-skill classifier using machine learning techniques to predict participant response to follow-up surveys of the Millennium Cohort Study. Six supervised algorithms were employed to predict response to the 2021 follow-up survey. Using latent class analysis (LCA), we classified participants based on historical survey response and compared prediction performance with and without this variable. Feature analysis was subsequently conducted on the best-performing model. Including the LCA variable in the machine learning analysis, all six algorithms performed comparably. Without the LCA variable, random forest outperformed the benchmark regression model, however overall prediction performance decreased. Feature analysis showed the LCA variable as the most important predictor. Our findings highlight the importance of historical response to improve prediction performance of participant response to follow-up surveys. Machine learning algorithms can be especially valuable when historical data are not available. Implementing these methods in longitudinal studies can enhance outreach efforts by strategically targeting participants, ultimately boosting survey response rates and mitigating nonresponse.
千禧年队列研究是一项纵向研究,通过调查收集自我报告数据,以研究兵役的长期影响。参与者对后续调查的无回应可能对研究结果的有效性和普遍性构成威胁。近年来,预测分析已成为识别无回应预测因素的有前途的工具。在这里,我们使用机器学习技术开发了一种高精度分类器,以预测千禧年队列研究参与者对后续调查的回应。使用了六种有监督算法来预测对 2021 年后续调查的回应。我们使用潜在类别分析(LCA)根据历史调查回应对参与者进行分类,并比较了有和没有此变量的预测性能。随后对表现最佳的模型进行了特征分析。在机器学习分析中包含 LCA 变量时,所有六种算法的表现相当。没有 LCA 变量时,随机森林的表现优于基准回归模型,但整体预测性能下降。特征分析表明 LCA 变量是最重要的预测因素。我们的研究结果强调了历史回应对于改善参与者对后续调查回应的预测性能的重要性。当没有历史数据时,机器学习算法尤其有价值。在纵向研究中实施这些方法可以通过有策略地针对参与者来增强外展工作,最终提高调查回应率并减轻无回应的影响。