比较多种多重填补方法以解决免疫接种信息系统中患者人口统计学数据缺失问题:回顾性队列研究。

Comparing Multiple Imputation Methods to Address Missing Patient Demographics in Immunization Information Systems: Retrospective Cohort Study.

作者信息

Brown Sara, Kudia Ousswa, Kleine Kaye, Kidd Bryndan, Wines Robert, Meckes Nathanael

机构信息

Scientific Services - Analytics, Scientific Technologies Corporation (United States), 411 S 1st St, Phoenix, AZ, 85004, United States, 1 480-745-8500.

Immunization Services, West Virginia Department of Health and Human Services, Charleston, WV, United States.

出版信息

JMIR Public Health Surveill. 2025 Aug 26;11:e73916. doi: 10.2196/73916.

Abstract

BACKGROUND

Immunization Information Systems (IIS) and surveillance data are essential for public health interventions and programming; however, missing data are often a challenge, potentially introducing bias and impacting the accuracy of vaccine coverage assessments, particularly in addressing disparities.

OBJECTIVE

This study aimed to evaluate the performance of 3 multiple imputation methods, Stata's (StataCorp LLC) multiple imputation using chained equations (MICE), scikit-learn's Iterative-Imputer, and Python's miceforest package, in managing missing race and ethnicity data in large-scale surveillance datasets. We compared these methodologies in their ability to preserve demographic distribution, computational efficiency, and performed G-tests on contingency tables to obtain likelihood ratio statistics to assess the association between race and ethnicity and flu vaccination status.

METHODS

In this retrospective cohort study, we analyzed 2021-2022 flu vaccination and demographic data from the West Virginia Immunization Information System (N=2,302,036), where race (15%) and ethnicity (34%) were missing. MICE, Iterative Imputer, and miceforest were used to impute missing variables, generating 15 datasets each. Computational efficiency, demographic distribution preservation, and spatial clustering patterns were assessed using G-statistics.

RESULTS

After imputation, an additional 780,339 observations were obtained compared with complete case analysis. All imputation methods exhibited significant spatial clustering for race imputation (G-statistics: MICE=26,452.7, Iterative-Imputer=128,280.3, Miceforest=26,891.5; P<.001), while ethnicity imputation showed variable clustering patterns (G-statistics: MICE=1142.2, Iterative-Imputer=1.7, Miceforest=2185.0; P: MICE<.001, Iterative-Imputer=1.7, Miceforest<.001). MICE and miceforest best preserved the proportional distribution of demographics. Computational efficiency varied, with MICE requiring 14 hours, Iterative Imputer 2 minutes, and miceforest 10 minutes for 15 imputations. Postimputation estimates indicated a 0.87%-18% reduction in stratified flu vaccination coverage rates. Overall estimated flu vaccination rates decreased from 26% to 19% after imputations.

CONCLUSIONS

Both MICE and Miceforest offer flexible and reliable approaches for imputing missing demographic data while mitigating bias compared with Iterative-Imputer. Our results also highlight that the imputation method can profoundly affect research findings. Though MICE and Miceforest had better effect sizes and reliability, MICE was much more computationally and time-expensive, limiting its use in large, surveillance datasets. Miceforest can use cloud-based computing, which further enhances efficiency by offloading resource-intensive tasks, enabling parallel execution, and minimizing processing delays. The significant decrease in vaccination coverage estimates validates how incomplete or missing data can eclipse real disparities. Our findings support regular application of imputation methods in immunization surveillance to improve health equity evaluations and shape targeted public health interventions and programming.

摘要

背景

免疫信息系统(IIS)和监测数据对于公共卫生干预措施和规划至关重要;然而,缺失数据往往是一个挑战,可能会引入偏差并影响疫苗接种覆盖率评估的准确性,尤其是在解决差异方面。

目的

本研究旨在评估三种多重插补方法,即Stata公司(StataCorp LLC)使用链式方程的多重插补(MICE)、scikit-learn的迭代插补器(Iterative-Imputer)以及Python的miceforest包,在处理大规模监测数据集中缺失的种族和族裔数据方面的性能。我们比较了这些方法在保持人口分布、计算效率方面的能力,并对列联表进行了G检验以获得似然比统计量,以评估种族和族裔与流感疫苗接种状况之间的关联。

方法

在这项回顾性队列研究中,我们分析了西弗吉尼亚免疫信息系统2021 - 2022年的流感疫苗接种和人口数据(N = 2,302,036),其中种族(15%)和族裔(34%)数据缺失。使用MICE、迭代插补器和miceforest对缺失变量进行插补,每种方法生成15个数据集。使用G统计量评估计算效率、人口分布保持情况和空间聚类模式。

结果

插补后,与完整病例分析相比,额外获得了780,339个观测值。所有插补方法在种族插补方面均表现出显著的空间聚类(G统计量:MICE = 26,452.7,迭代插补器 = 128,280.3,miceforest = 26,891.5;P <.001),而族裔插补则呈现出不同的聚类模式(G统计量:MICE = 1142.2,迭代插补器 = 1.7,miceforest = 2185.0;P值:MICE <.001,迭代插补器 = 1.7,miceforest <.001)。MICE和miceforest在保持人口统计学比例分布方面表现最佳。计算效率各不相同,对于15次插补,MICE需要14小时,迭代插补器需要2分钟,miceforest需要10分钟。插补后的估计表明分层流感疫苗接种覆盖率降低了0.87% - 18%。总体估计的流感疫苗接种率在插补后从26%降至19%。

结论

与迭代插补器相比,MICE和miceforest在插补缺失的人口数据时提供了灵活且可靠的方法,同时减轻了偏差。我们的结果还强调,插补方法会对研究结果产生深远影响。尽管MICE和miceforest具有更好的效应大小和可靠性,但MICE在计算和时间成本上要高得多,限制了其在大型监测数据集中的应用。miceforest可以使用基于云的计算,通过卸载资源密集型任务、实现并行执行并最小化处理延迟,进一步提高效率。疫苗接种覆盖率估计值的显著下降证实了不完整或缺失数据如何掩盖实际差异。我们的研究结果支持在免疫监测中定期应用插补方法,以改善健康公平性评估,并制定有针对性的公共卫生干预措施和规划。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0103/12380239/9dc06e685648/publichealth-v11-e73916-g001.jpg

相似文献

本文引用的文献

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索