Suppr超能文献

处理 COVID-19 发病率估计中的缺失数据:二次数据分析。

Handling Missing Data in COVID-19 Incidence Estimation: Secondary Data Analysis.

机构信息

School of Preventive Medicine and Public Health, Hanoi Medical University, 1 Ton That Tung Street, Kim Lien Ward, Dong Da District, Hanoi, 100000, Vietnam, 84 368-577-4236.

UMass Chan Medical School, University of Massachusetts Medical School, Worcester, MA, United States.

出版信息

JMIR Public Health Surveill. 2024 Aug 20;10:e53719. doi: 10.2196/53719.

Abstract

BACKGROUND

The COVID-19 pandemic has revealed significant challenges in disease forecasting and in developing a public health response, emphasizing the need to manage missing data from various sources in making accurate forecasts.

OBJECTIVE

We aimed to show how handling missing data can affect estimates of the COVID-19 incidence rate (CIR) in different pandemic situations.

METHODS

This study used data from the COVID-19/SARS-CoV-2 surveillance system at the National Institute of Hygiene and Epidemiology, Vietnam. We separated the available data set into 3 distinct periods: zero COVID-19, transition, and new normal. We randomly removed 5% to 30% of data that were missing completely at random, with a break of 5% at each time point in the variable daily caseload of COVID-19. We selected 7 analytical methods to assess the effects of handling missing data and calculated statistical and epidemiological indices to measure the effectiveness of each method.

RESULTS

Our study examined missing data imputation performance across 3 study time periods: zero COVID-19 (n=3149), transition (n=1290), and new normal (n=9288). Imputation analyses showed that K-nearest neighbor (KNN) had the lowest mean absolute percentage change (APC) in CIR across the range (5% to 30%) of missing data. For instance, with 15% missing data, KNN resulted in 10.6%, 10.6%, and 9.7% average bias across the zero COVID-19, transition, and new normal periods, compared to 39.9%, 51.9%, and 289.7% with the maximum likelihood method. The autoregressive integrated moving average model showed the greatest mean APC in the mean number of confirmed cases of COVID-19 during each COVID-19 containment cycle (CCC) when we imputed the missing data in the zero COVID-19 period, rising from 226.3% at the 5% missing level to 6955.7% at the 30% missing level. Imputing missing data with median imputation methods had the lowest bias in the average number of confirmed cases in each CCC at all levels of missing data. In detail, in the 20% missing scenario, while median imputation had an average bias of 16.3% for confirmed cases in each CCC, which was lower than the KNN figure, maximum likelihood imputation showed a bias on average of 92.4% for confirmed cases in each CCC, which was the highest figure. During the new normal period in the 25% and 30% missing data scenarios, KNN imputation had average biases for CIR and confirmed cases in each CCC ranging from 21% to 32% for both, while maximum likelihood and moving average imputation showed biases on average above 250% for both CIR and confirmed cases in each CCC.

CONCLUSIONS

Our study emphasizes the importance of understanding that the specific imputation method used by investigators should be tailored to the specific epidemiological context and data collection environment to ensure reliable estimates of the CIR.

摘要

背景

COVID-19 大流行揭示了在疾病预测和制定公共卫生应对措施方面的重大挑战,强调了在进行准确预测时需要处理来自各种来源的缺失数据。

目的

我们旨在展示在不同大流行情况下,处理缺失数据如何影响 COVID-19 发病率(CIR)的估计。

方法

本研究使用了越南国家卫生和流行病学研究所的 COVID-19/SARS-CoV-2 监测系统的数据。我们将可用数据集分为 3 个不同时期:零 COVID-19、过渡和新常态。我们随机删除了 5%至 30%的完全随机缺失数据,在 COVID-19 日病例数变量中每个时间点以 5%的间隔进行缺失。我们选择了 7 种分析方法来评估处理缺失数据的效果,并计算了统计和流行病学指标来衡量每种方法的效果。

结果

我们的研究在 3 个研究时间段(零 COVID-19、过渡和新常态)中检查了缺失数据插补表现:零 COVID-19(n=3149)、过渡(n=1290)和新常态(n=9288)。插补分析表明,K-最近邻(KNN)在缺失数据范围(5%至 30%)内对 CIR 的平均绝对百分比变化(APC)最低。例如,在缺失 15%的数据时,与最大似然法相比,KNN 在零 COVID-19、过渡和新常态期间的平均偏差分别为 10.6%、10.6%和 9.7%。而最大似然法的平均偏差分别为 39.9%、51.9%和 289.7%。在零 COVID-19 期间插补缺失数据时,自回归综合移动平均模型显示在每个 COVID-19 控制周期(CCC)中 COVID-19 确诊病例数的平均 APC 最高,从 5%缺失水平的 226.3%上升到 30%缺失水平的 6955.7%。在所有缺失数据水平下,中位数插补方法在每个 CCC 的确诊病例数的平均偏差最低。具体来说,在 20%缺失的情况下,虽然中位数插补对每个 CCC 的确诊病例的平均偏差为 16.3%,低于 KNN 的数字,但最大似然插补对每个 CCC 的确诊病例的平均偏差为 92.4%,这是最高的数字。在新常态时期,在 25%和 30%的缺失数据情况下,KNN 插补对 CIR 和每个 CCC 的确诊病例的平均偏差在 21%到 32%之间,而最大似然和移动平均插补对 CIR 和每个 CCC 的确诊病例的平均偏差均超过 250%。

结论

我们的研究强调了一个重要的认识,即研究人员使用的具体插补方法应根据特定的流行病学背景和数据收集环境进行调整,以确保 CIR 的可靠估计。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/36b1/11350390/d4853591168d/publichealth-v10-e53719-g001.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验