Suppr超能文献

利用自我报告的全球推文识别潜在莱姆病病例:通过表情符号增强带有情感词汇的深度学习模型。

Identifying Potential Lyme Disease Cases Using Self-Reported Worldwide Tweets: Deep Learning Modeling Approach Enhanced With Sentimental Words Through Emojis.

机构信息

Département de médecine sociale et préventive, École de Santé Publique de l'Université de Montréal, Université de Montréal, Montréal, QC, Canada.

Department of Mathematics, Faculty of Science, Zagazig University, Zagazig, Egypt.

出版信息

J Med Internet Res. 2023 Oct 16;25:e47014. doi: 10.2196/47014.

Abstract

BACKGROUND

Lyme disease is among the most reported tick-borne diseases worldwide, making it a major ongoing public health concern. An effective Lyme disease case reporting system depends on timely diagnosis and reporting by health care professionals, and accurate laboratory testing and interpretation for clinical diagnosis validation. A lack of these can lead to delayed diagnosis and treatment, which can exacerbate the severity of Lyme disease symptoms. Therefore, there is a need to improve the monitoring of Lyme disease by using other data sources, such as web-based data.

OBJECTIVE

We analyzed global Twitter data to understand its potential and limitations as a tool for Lyme disease surveillance. We propose a transformer-based classification system to identify potential Lyme disease cases using self-reported tweets.

METHODS

Our initial sample included 20,000 tweets collected worldwide from a database of over 1.3 million Lyme disease tweets. After preprocessing and geolocating tweets, tweets in a subset of the initial sample were manually labeled as potential Lyme disease cases or non-Lyme disease cases using carefully selected keywords. Emojis were converted to sentiment words, which were then replaced in the tweets. This labeled tweet set was used for the training, validation, and performance testing of DistilBERT (distilled version of BERT [Bidirectional Encoder Representations from Transformers]), ALBERT (A Lite BERT), and BERTweet (BERT for English Tweets) classifiers.

RESULTS

The empirical results showed that BERTweet was the best classifier among all evaluated models (average F1-score of 89.3%, classification accuracy of 90.0%, and precision of 97.1%). However, for recall, term frequency-inverse document frequency and k-nearest neighbors performed better (93.2% and 82.6%, respectively). On using emojis to enrich the tweet embeddings, BERTweet had an increased recall (8% increase), DistilBERT had an increased F1-score of 93.8% (4% increase) and classification accuracy of 94.1% (4% increase), and ALBERT had an increased F1-score of 93.1% (5% increase) and classification accuracy of 93.9% (5% increase). The general awareness of Lyme disease was high in the United States, the United Kingdom, Australia, and Canada, with self-reported potential cases of Lyme disease from these countries accounting for around 50% (9939/20,000) of the collected English-language tweets, whereas Lyme disease-related tweets were rare in countries from Africa and Asia. The most reported Lyme disease-related symptoms in the data were rash, fatigue, fever, and arthritis, while symptoms, such as lymphadenopathy, palpitations, swollen lymph nodes, neck stiffness, and arrythmia, were uncommon, in accordance with Lyme disease symptom frequency.

CONCLUSIONS

The study highlights the robustness of BERTweet and DistilBERT as classifiers for potential cases of Lyme disease from self-reported data. The results demonstrated that emojis are effective for enrichment, thereby improving the accuracy of tweet embeddings and the performance of classifiers. Specifically, emojis reflecting sadness, empathy, and encouragement can reduce false negatives.

摘要

背景

莱姆病是全球报告最多的蜱传疾病之一,因此成为一个持续存在的主要公共卫生关注点。一个有效的莱姆病病例报告系统依赖于医疗保健专业人员的及时诊断和报告,以及临床诊断验证的准确实验室检测和解释。缺乏这些可能会导致诊断和治疗的延迟,从而使莱姆病的症状恶化。因此,需要利用其他数据源(如基于网络的数据)来改善莱姆病的监测。

目的

我们分析了全球 Twitter 数据,以了解其作为莱姆病监测工具的潜力和局限性。我们提出了一种基于转换器的分类系统,该系统使用自我报告的推文来识别潜在的莱姆病病例。

方法

我们的初始样本包括从超过 130 万条莱姆病推文数据库中收集的全球范围内的 20000 条推文。在对推文进行预处理和地理定位后,使用精心挑选的关键词,对初始样本中的一部分推文进行手动标记为潜在莱姆病病例或非莱姆病病例。表情符号被转换为情感词,然后替换在推文中。这个标记的推文集用于训练、验证和性能测试 DistilBERT(BERT 的精简版[双向转换器表示])、ALBERT(精简版 BERT)和 BERTweet(用于英语推文的 BERT)分类器。

结果

实证结果表明,BERTweet 是所有评估模型中最好的分类器(平均 F1 分数为 89.3%,分类准确率为 90.0%,精度为 97.1%)。然而,对于召回率,词频逆文档频率和 K 最近邻表现更好(分别为 93.2%和 82.6%)。在使用表情符号丰富推文嵌入时,BERTweet 的召回率有所提高(提高 8%),DistilBERT 的 F1 分数提高了 93.8%(提高 4%)和分类准确率提高了 94.1%(提高 4%),ALBERT 的 F1 分数提高了 93.1%(提高 5%)和分类准确率提高了 93.9%(提高 5%)。在美国、英国、澳大利亚和加拿大,莱姆病的一般认识度较高,这些国家的自我报告的莱姆病潜在病例占收集到的英语推文的约 50%(9939/20000),而来自非洲和亚洲国家的莱姆病相关推文则很少。数据中报告的最常见的莱姆病相关症状是皮疹、疲劳、发烧和关节炎,而淋巴结病、心悸、肿胀的淋巴结、颈部僵硬和心律失常等症状则不常见,这与莱姆病的症状频率一致。

结论

该研究强调了 BERTweet 和 DistilBERT 作为自我报告数据中莱姆病潜在病例分类器的稳健性。结果表明,表情符号可有效用于丰富推文嵌入,从而提高分类器的准确性。具体来说,反映悲伤、同情和鼓励的表情符号可以减少假阴性。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2191/10616745/bee266c985e7/jmir_v25i1e47014_fig1.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验