Gao Wang, Li Lin, Tao Xiaohui, Zhou Jing, Tao Jun
School of Artificial Intelligence, Jianghan University, 430056 Wuhan, China.
School of Computer Science and Artificial Intelligence, Wuhan University of Technology, 430070 Wuhan, China.
World Wide Web. 2023;26(1):55-70. doi: 10.1007/s11280-022-01034-1. Epub 2022 Mar 16.
Every epidemic affects the real lives of many people around the world and leads to terrible consequences. Recently, many tweets about the COVID-19 pandemic have been shared publicly on social media platforms. The analysis of these tweets is helpful for emergency response organizations to prioritize their tasks and make better decisions. However, most of these tweets are non-informative, which is a challenge for establishing an automated system to detect useful information in social media. Furthermore, existing methods ignore unlabeled data and topic background knowledge, which can provide additional semantic information. In this paper, we propose a novel Topic-Aware BERT (TABERT) model to solve the above challenges. TABERT first leverages a topic model to extract the latent topics of tweets. Secondly, a flexible framework is used to combine topic information with the output of BERT. Finally, we adopt adversarial training to achieve semi-supervised learning, and a large amount of unlabeled data can be used to improve inner representations of the model. Experimental results on the dataset of COVID-19 English tweets show that our model outperforms classic and state-of-the-art baselines.
每一次疫情都会影响全球许多人的真实生活,并导致可怕的后果。最近,许多关于新冠疫情的推文在社交媒体平台上被公开分享。对这些推文进行分析有助于应急响应组织确定任务优先级并做出更好的决策。然而,这些推文中大多数都没有实际信息,这对建立一个在社交媒体中检测有用信息的自动化系统来说是一项挑战。此外,现有方法忽略了未标记数据和主题背景知识,而这些可以提供额外的语义信息。在本文中,我们提出了一种新颖的主题感知BERT(TABERT)模型来解决上述挑战。TABERT首先利用主题模型提取推文的潜在主题。其次,使用一个灵活的框架将主题信息与BERT的输出相结合。最后,我们采用对抗训练来实现半监督学习,并且可以使用大量未标记数据来改进模型的内部表示。在新冠疫情英文推文数据集上的实验结果表明,我们的模型优于经典和最新的基线模型。