Li Megan Mun, Pham Anh, Kuo Tsung-Ting
Department of Biology, University of California San Diego, La Jolla, California, USA.
UCSD Health Department of Biomedical Informatics, University of California San Diego, La Jolla, California, USA.
JAMIA Open. 2022 Jun 25;5(3):ooac056. doi: 10.1093/jamiaopen/ooac056. eCollection 2022 Oct.
Predicting daily trends in the Coronavirus Disease 2019 (COVID-19) case number is important to support individual decisions in taking preventative measures. This study aims to use COVID-19 case number history, demographic characteristics, and social distancing policies both independently/interdependently to predict the daily trend in the rise or fall of county-level cases.
We extracted 2093 features (5 from the US COVID-19 case number history, 1824 from the demographic characteristics independently/interdependently, and 264 from the social distancing policies independently/interdependently) for 3142 US counties. Using the top selected 200 features, we built 4 machine learning models: Logistic Regression, Naïve Bayes, Multi-Layer Perceptron, and Random Forest, along with 4 Ensemble methods: Average, Product, Minimum, and Maximum, and compared their performances.
The Ensemble Average method had the highest area-under the receiver operator characteristic curve (AUC) of 0.692. The top ranked features were all interdependent features.
The findings of this study suggest the predictive power of diverse features, especially when combined, in predicting county-level trends of COVID-19 cases and can be helpful to individuals in making their daily decisions. Our results may guide future studies to consider more features interdependently from conventionally distinct data sources in county-level predictive models. Our code is available at: https://doi.org/10.5281/zenodo.6332944.
预测2019年冠状病毒病(COVID-19)病例数的每日趋势对于支持个人采取预防措施的决策非常重要。本研究旨在独立/相互依赖地使用COVID-19病例数历史、人口特征和社会 distancing政策来预测县级病例上升或下降的每日趋势。
我们为3142个美国县提取了2093个特征(5个来自美国COVID-19病例数历史,1824个独立/相互依赖地来自人口特征,264个独立/相互依赖地来自社会 distancing政策)。使用精选的前200个特征,我们构建了4种机器学习模型:逻辑回归、朴素贝叶斯、多层感知器和随机森林,以及4种集成方法:平均、乘积、最小和最大,并比较了它们的性能。
集成平均方法的接收器操作特征曲线(AUC)下面积最高,为0.692。排名靠前的特征都是相互依赖的特征。
本研究结果表明,多种特征,尤其是组合时,在预测县级COVID-19病例趋势方面具有预测能力,有助于个人做出日常决策。我们的结果可能会指导未来的研究在县级预测模型中从传统上不同的数据来源更相互依赖地考虑更多特征。我们的代码可在以下网址获取:https://doi.org/10.5281/zenodo.6332944 。