识别与痴呆症和新冠肺炎相关的X（原推特）帖子：机器学习方法

Identifying X (Formerly Twitter) Posts Relevant to Dementia and COVID-19: Machine Learning Approach.

作者信息

Azizi Mehrnoosh, Jamali Ali Akbar, Spiteri Raymond J

机构信息

Department of Computer Science, University of Saskatchewan, Saskatoon, SK, Canada.

出版信息

JMIR Form Res. 2024 Jun 4;8:e49562. doi: 10.2196/49562.

DOI:10.2196/49562

PMID:38833288

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11185906/

Abstract

BACKGROUND

During the pandemic, patients with dementia were identified as a vulnerable population. X (formerly Twitter) became an important source of information for people seeking updates on COVID-19, and, therefore, identifying posts (formerly tweets) relevant to dementia can be an important support for patients with dementia and their caregivers. However, mining and coding relevant posts can be daunting due to the sheer volume and high percentage of irrelevant posts.

OBJECTIVE

The objective of this study was to automate the identification of posts relevant to dementia and COVID-19 using natural language processing and machine learning (ML) algorithms.

METHODS

We used a combination of natural language processing and ML algorithms with manually annotated posts to identify posts relevant to dementia and COVID-19. We used 3 data sets containing more than 100,000 posts and assessed the capability of various algorithms in correctly identifying relevant posts.

RESULTS

Our results showed that (pretrained) transfer learning algorithms outperformed traditional ML algorithms in identifying posts relevant to dementia and COVID-19. Among the algorithms tested, the transfer learning algorithm A Lite Bidirectional Encoder Representations from Transformers (ALBERT) achieved an accuracy of 82.92% and an area under the curve of 83.53%. ALBERT substantially outperformed the other algorithms tested, further emphasizing the superior performance of transfer learning algorithms in the classification of posts.

CONCLUSIONS

Transfer learning algorithms such as ALBERT are highly effective in identifying topic-specific posts, even when trained with limited or adjacent data, highlighting their superiority over other ML algorithms and applicability to other studies involving analysis of social media posts. Such an automated approach reduces the workload of manual coding of posts and facilitates their analysis for researchers and policy makers to support patients with dementia and their caregivers and other vulnerable populations.

摘要

背景

在疫情期间，痴呆症患者被认定为弱势群体。X（前身为推特）成为了人们获取新冠病毒最新信息的重要来源，因此，识别与痴呆症相关的帖子（前身为推文）对痴呆症患者及其护理人员而言可能是一项重要的支持。然而，由于无关帖子数量庞大且占比高，挖掘和编码相关帖子可能具有挑战性。

目的

本研究的目的是使用自然语言处理和机器学习（ML）算法自动识别与痴呆症和新冠病毒相关的帖子。

方法

我们将自然语言处理和ML算法与人工标注的帖子相结合，以识别与痴呆症和新冠病毒相关的帖子。我们使用了3个包含超过10万条帖子的数据集，并评估了各种算法正确识别相关帖子的能力。

结果

我们的结果表明，（预训练）迁移学习算法在识别与痴呆症和新冠病毒相关的帖子方面优于传统ML算法。在测试的算法中，迁移学习算法“来自变换器的轻量级双向编码器表征”（ALBERT）的准确率达到82.92%，曲线下面积为83.53%。ALBERT的表现显著优于其他测试算法，进一步凸显了迁移学习算法在帖子分类中的卓越性能。

结论

像ALBERT这样的迁移学习算法在识别特定主题的帖子方面非常有效，即使使用有限或相邻数据进行训练也是如此，这突出了它们相对于其他ML算法的优越性以及对其他涉及社交媒体帖子分析的研究的适用性。这种自动化方法减少了帖子人工编码的工作量，并便于研究人员和政策制定者进行分析，以支持痴呆症患者及其护理人员以及其他弱势群体。