Carnes Nate C, Kolaja Claire A, Lewis Crystal L, Castañeda Sheila F, Rull Rudolph P
Deployment Health Research Department, Naval Health Research Center, San Diego, CA, USA.
Leidos, Inc, San Diego, CA, USA.
BMC Med Res Methodol. 2025 Jul 14;25(1):174. doi: 10.1186/s12874-025-02620-3.
Missing survey data can threaten the validity and generalizability of findings from longitudinal cohort studies. Respondent characteristics and survey attributes may contribute to patterns of survey non-completion, a form of missing data in which respondents begin but do not finish a survey, that can lead to biased conclusions. The objectives of the present research are to demonstrate how machine learning can identify survey non-completion and to characterize individual and methodological factors that are associated with this form of data missingness.
The present study developed a novel machine learning algorithm to characterize survey non-completion in the Millennium Cohort Study during the 2019-2021 data collection cycle that included a 30- to 45-min paper or web-based follow-up survey for previously enrolled panels (Panels 1-4, n = 80,986) and a 30- to 45-min web-based baseline survey for new enrollees (Panel 5, n = 58,609). We then examined the effect of individual characteristics and survey attributes on survey non-completion.
This algorithm achieved 99% accuracy and showed that 0.29% of follow-up respondents and 15.43% of new enrollees were survey non-completers. Our findings suggest that certain military and sociodemographic characteristics (e.g., enlisted pay grades) were associated with increased survey non-completion in the 2019-2021 cycle. Survey attributes explained a large proportion of the variability in survey non-completion, with our analyses indicating a higher likelihood of survey non-completion in Sects. (1) located toward the beginning of the survey, (2) with sensitive questions, and (3) with fewer questions.
This research highlights the importance of accounting for potential respondent bias due to survey non-completion and identifies factors associated with this type of missing data.
缺失的调查数据可能会威胁纵向队列研究结果的有效性和普遍性。受访者特征和调查属性可能导致调查未完成模式,这是一种缺失数据形式,即受访者开始但未完成调查,可能导致有偏差的结论。本研究的目的是证明机器学习如何识别调查未完成情况,并描述与这种数据缺失形式相关的个体和方法学因素。
本研究开发了一种新颖的机器学习算法,以描述2019 - 2021年数据收集周期中千禧队列研究的调查未完成情况,该周期包括对先前登记小组(第1 - 4组,n = 80,986)进行30至45分钟的纸质或网络后续调查,以及对新登记人员(第5组,n = 58,609)进行30至45分钟的网络基线调查。然后,我们研究了个体特征和调查属性对调查未完成情况的影响。
该算法的准确率达到99%,结果显示0.29%的后续受访者和15.43%的新登记人员为调查未完成者。我们的研究结果表明,某些军事和社会人口特征(如士兵薪级)与2019 - 2021周期中调查未完成情况的增加有关。调查属性解释了调查未完成情况中很大一部分变异性,我们的分析表明,在以下部分调查未完成的可能性更高:(1)位于调查开始部分;(2)包含敏感问题;(3)问题较少。
本研究强调了考虑因调查未完成导致的潜在受访者偏差的重要性,并识别了与这类缺失数据相关的因素。