Department of Statistics, George Washington University, 2121 I St NW, Washington, DC, 20052, USA.
Office of Biostatistics, Office of Translational Sciences, Center for Drug Evaluation and Research, Food and Drug Administration (FDA), 10903 New Hampshire Avenue, Silver Spring, MD, 20993, USA.
Sci Rep. 2023 Aug 22;13(1):13721. doi: 10.1038/s41598-023-39986-7.
We used social media data from "covid19positive" subreddit, from 03/2020 to 03/2022 to identify COVID-19 cases and extract their reported symptoms automatically using natural language processing (NLP). We trained a Bidirectional Encoder Representations from Transformers classification model with chunking to identify COVID-19 cases; also, we developed a novel QuadArm model, which incorporates Question-answering, dual-corpus expansion, Adaptive rotation clustering, and mapping, to extract symptoms. Our classification model achieved a 91.2% accuracy for the early period (03/2020-05/2020) and was applied to the Delta (07/2021-09/2021) and Omicron (12/2021-03/2022) periods for case identification. We identified 310, 8794, and 12,094 COVID-positive authors in the three periods, respectively. The top five common symptoms extracted in the early period were coughing (57%), fever (55%), loss of sense of smell (41%), headache (40%), and sore throat (40%). During the Delta period, these symptoms remained as the top five symptoms with percent authors reporting symptoms reduced to half or fewer than the early period. During the Omicron period, loss of sense of smell was reported less while sore throat was reported more. Our study demonstrated that NLP can be used to identify COVID-19 cases accurately and extracted symptoms efficiently.
我们使用了“covid19positive”子版块的社交媒体数据,从 2020 年 3 月至 2022 年 3 月,利用自然语言处理(NLP)自动识别 COVID-19 病例并提取其报告的症状。我们使用分块技术训练了一个基于双向编码器表示的转换器分类模型来识别 COVID-19 病例;此外,我们还开发了一种新颖的 QuadArm 模型,该模型结合了问答、双语料库扩展、自适应旋转聚类和映射,以提取症状。我们的分类模型在早期(2020 年 3 月至 2020 年 5 月)的准确率为 91.2%,并应用于 Delta(2021 年 7 月至 2021 年 9 月)和 Omicron(2021 年 12 月至 2022 年 3 月)时期进行病例识别。我们分别在这三个时期中识别出 310、8794 和 12094 名 COVID-19 阳性作者。在早期,提取的前五种常见症状为咳嗽(57%)、发热(55%)、嗅觉丧失(41%)、头痛(40%)和喉咙痛(40%)。在 Delta 时期,这些症状仍然是前五名症状,但报告症状的作者百分比减少到早期的一半或更少。在 Omicron 时期,嗅觉丧失的报告减少了,而喉咙痛的报告增多了。我们的研究表明,NLP 可用于准确识别 COVID-19 病例并有效地提取症状。