Liu Yang, Whitfield Christopher, Zhang Tianyang, Hauser Amanda, Reynolds Taeyonn, Anwar Mohd
Human-Centered AI (HC-AI) Lab, North Carolina A&T State University, Greensboro, NC 27411 USA.
University of Massachusetts Amherst, Amherst, MA 01003 USA.
Health Inf Sci Syst. 2021 Jun 25;9(1):25. doi: 10.1007/s13755-021-00158-4. eCollection 2021 Dec.
It has been over a year since the first known case of coronavirus disease (COVID-19) emerged, yet the pandemic is far from over. To date, the coronavirus pandemic has infected over eighty million people and has killed more than 1.78 million worldwide. This study aims to explore "" and "". The purpose of this study was to compare people's thoughts, behavior changes, discussion topics, and the number of confirmed cases and deaths by applying natural language processing (NLP) to COVID-19 related data.
In this study, we collected COVID-19 related data from 18 subreddits of North Carolina from March to August 2020. Next, we applied methods from natural language processing and machine learning to analyze collected Reddit posts using feature engineering, topic modeling, custom named-entity recognition (NER), and BERT-based (Bidirectional Encoder Representations from Transformers) sentence clustering. Using these methods, we were able to glean people's responses and their concerns about COVID-19 pandemic in North Carolina.
We observed a positive change in attitudes towards masks for residents in North Carolina. The high-frequency words in all subreddit corpora for each of the COVID-19 mitigation strategy categories are: Distancing (DIST)-"", "", and ""; Disinfection (DIT)-"", "", and ""; Personal Protective Equipment (PPE)-"", "", and ""; Symptoms (SYM)-"", "", and ""; Testing (TEST)-"", "( "".
The findings in our study show that the use of Reddit data to monitor COVID-19 pandemic in North Carolina (NC) was effective. The study shows the utility of NLP methods (e.g. cosine similarity, Latent Dirichlet Allocation (LDA) topic modeling, custom NER and BERT-based sentence clustering) in discovering the change of the public's concerns/behaviors over the course of COVID-19 pandemic in NC using Reddit data. Moreover, the results show that social media data can be utilized to surveil the epidemic situation in a specific community.
自首例已知的冠状病毒病(COVID-19)病例出现以来已过去一年多,但大流行远未结束。迄今为止,冠状病毒大流行已在全球感染了超过8000万人,并导致超过178万人死亡。本研究旨在探索“”和“”。本研究的目的是通过对与COVID-19相关的数据应用自然语言处理(NLP)来比较人们的想法、行为变化、讨论话题以及确诊病例数和死亡人数。
在本研究中,我们收集了2020年3月至8月来自北卡罗来纳州18个Reddit社区的与COVID-19相关的数据。接下来,我们应用自然语言处理和机器学习方法,通过特征工程、主题建模、自定义命名实体识别(NER)以及基于BERT(来自Transformer的双向编码器表示)的句子聚类来分析收集到的Reddit帖子。使用这些方法,我们能够了解北卡罗来纳州人们对COVID-19大流行的反应及其担忧。
我们观察到北卡罗来纳州居民对口罩的态度有积极变化。每个COVID-19缓解策略类别的所有Reddit语料库中的高频词分别为:社交距离(DIST)——“”、“”和“”;消毒(DIT)——“”、“”和“”;个人防护装备(PPE)——“”、“”和“”;症状(SYM)——“”、“”和“”;检测(TEST)——“”、“( ”。
我们研究中的发现表明,使用Reddit数据监测北卡罗来纳州(NC)的COVID-19大流行是有效的。该研究展示了NLP方法(如余弦相似度、潜在狄利克雷分配(LDA)主题建模、自定义NER和基于BERT的句子聚类)在利用Reddit数据发现北卡罗来纳州COVID-19大流行期间公众担忧/行为变化方面的效用。此外,结果表明社交媒体数据可用于监测特定社区的疫情情况。