Ahammed Tanvir, Hossain Md Sakhawat, McMahan Christopher, Rennert Lior
Department of Public Health Sciences, Clemson University, Clemson, SC, USA; Center for Public Health Modeling and Response, Clemson University, Clemson, SC, USA.
Center for Public Health Modeling and Response, Clemson University, Clemson, SC, USA; School of Mathematical and Statistical Sciences, Clemson University, Clemson, SC, USA.
Epidemics. 2025 Jun;51:100823. doi: 10.1016/j.epidem.2025.100823. Epub 2025 Apr 3.
The lack of conventional methods of estimating real-time infectious disease burden in granular regions inhibits timely and efficient public health response. Comprehensive data sources (e.g., state health department data) typically needed for such estimation are often limited due to 1) substantial delays in data reporting and 2) lack of geographic granularity in data provided to researchers. Leveraging real-time local health system data presents an opportunity to overcome these challenges. This study evaluates the effectiveness of machine learning and statistical approaches using local health system data to estimate current and previous COVID-19 hospitalizations in South Carolina. Random Forest models demonstrated consistently higher average median percent agreement accuracy compared to generalized linear mixed models for current weekly hospitalizations across 123 ZIP codes (72.29 %, IQR: 63.20-75.62 %) and 28 counties (76.43 %, IQR: 70.33-81.16 %) with sufficient health system coverage. To account for underrepresented populations in health systems, we combined Random Forest models with Classification and Regression Trees (CART) for imputation. The average median percent agreement was 61.02 % (IQR: 51.17-72.29 %) for all ZIP codes and 72.64 % (IQR: 66.13-77.69 %) for all counties. Median percent agreement for cumulative hospitalizations over the previous 6 months was 80.98 % (IQR: 68.99-89.66 %) for all ZIP codes and 81.17 % (IQR: 68.55-91.33 %) for all counties. These findings emphasize the effectiveness of utilizing real-time health system data to estimate infectious disease burden. Moreover, the methodologies developed in this study can be adapted to estimate hospitalizations for other diseases, offering a valuable tool for public health officials to respond swiftly and effectively to various health crises.
缺乏在精细区域估计实时传染病负担的传统方法,这阻碍了及时有效的公共卫生应对。进行此类估计通常所需的综合数据源(如州卫生部门的数据)往往受到限制,原因如下:1)数据报告存在严重延迟;2)提供给研究人员的数据缺乏地理精细度。利用实时本地卫生系统数据为克服这些挑战提供了契机。本研究评估了使用本地卫生系统数据的机器学习和统计方法在估计南卡罗来纳州当前和既往新冠住院情况方面的有效性。对于123个邮政编码区域(平均中位数百分比一致性准确率为72.29%,四分位距:63.20 - 75.62%)和28个县(平均中位数百分比一致性准确率为76.43%,四分位距:70.33 - 81.16%)且卫生系统覆盖充分的地区,随机森林模型在估计当前每周住院情况时,与广义线性混合模型相比,始终表现出更高的平均中位数百分比一致性准确率。为了考虑卫生系统中代表性不足的人群,我们将随机森林模型与分类回归树(CART)相结合进行插补。所有邮政编码区域的平均中位数百分比一致性为61.02%(四分位距:51.17 - 72.29%),所有县的为72.64%(四分位距:66.13 - 77.69%)。所有邮政编码区域过去6个月累计住院情况的中位数百分比一致性为80.98%(四分位距:68.99 - 89.66%),所有县的为81.17%(四分位距:68.55 - 91.33%)。这些发现强调了利用实时卫生系统数据估计传染病负担的有效性。此外,本研究中开发的方法可用于估计其他疾病的住院情况,为公共卫生官员迅速有效地应对各种健康危机提供了宝贵工具。