Kothuru Srinivasulu, Santhanavijayan A
Department of Computer Science and Engineering, National Institute of Technology, Thuvakudi, Tiruchirappalli, Tamil Nadu 620015 India.
Soc Netw Anal Min. 2023;13(1):25. doi: 10.1007/s13278-023-01025-8. Epub 2023 Jan 17.
Identifying COVID-19 informative tweets is very useful in building monitoring systems to track the latest updates. Existing approaches to identify informative tweets rely on a large number of labelled tweets to achieve good performances. As labelling is an expensive and laborious process, there is a need to develop approaches that can identify COVID-19 informative tweets using limited labelled data. In this paper, we propose a simple yet novel labelled data-efficient approach that achieves the state-of-the-art (SOTA) F1-score of 91.23 on the WNUT COVID-19 dataset using just 1000 tweets (14.3% of the full training set). Our labelled data-efficient approach starts with limited labelled data, augment it using data augmentation methods and then fine-tune the model using augmented data set. It is the first work to approach the task of identifying COVID-19 English informative tweets using limited labelled data yet achieve the new SOTA performance.
识别与新冠疫情相关的信息推文对于构建追踪最新动态的监测系统非常有用。现有的识别信息推文的方法依赖大量带标签的推文才能取得良好效果。由于标注是一个昂贵且费力的过程,因此需要开发能够使用有限的带标签数据来识别与新冠疫情相关的信息推文的方法。在本文中,我们提出了一种简单而新颖的高效利用带标签数据的方法,该方法在WNUT新冠疫情数据集上仅使用1000条推文(占完整训练集的14.3%)就达到了91.23的最优F1分数。我们的高效利用带标签数据的方法从有限的带标签数据开始,使用数据增强方法对其进行扩充,然后使用扩充后的数据集对模型进行微调。这是第一项使用有限的带标签数据来处理识别与新冠疫情相关的英文信息推文任务并取得新的最优性能的工作。