Big Data Analytics and Web Intelligence Laboratory, Department of Computer Science & Engineering, Delhi Technological University, New Delhi, India.
Comput Biol Med. 2021 Nov;138:104920. doi: 10.1016/j.compbiomed.2021.104920. Epub 2021 Oct 12.
The recent outbreak of novel Coronavirus disease or COVID-19 is declared a pandemic by the World Health Organization (WHO). The availability of social media platforms has played a vital role in providing and obtaining information about any ongoing event. However, consuming a vast amount of online textual data to predict an event's trends can be troublesome. To our knowledge, no study analyzes the online news articles and the disease data about coronavirus disease. Therefore, we propose an LDA-based topic model, called PAN-LDA (Pandemic-Latent Dirichlet allocation), that incorporates the COVID-19 cases data and news articles into common LDA to obtain a new set of features. The generated features are introduced as additional features to Machine learning(ML) algorithms to improve the forecasting of time series data. Furthermore, we are employing collapsed Gibbs sampling (CGS) as the underlying technique for parameter inference. The results from experiments suggest that the obtained features from PAN-LDA generate more identifiable topics and empirically add value to the outcome.
新型冠状病毒病(COVID-19)的爆发最近被世界卫生组织(WHO)宣布为大流行。社交媒体平台的普及在提供和获取有关任何正在进行的事件的信息方面发挥了至关重要的作用。然而,要预测事件的趋势,消耗大量的在线文本数据可能会很麻烦。据我们所知,目前还没有研究分析有关冠状病毒病的在线新闻文章和疾病数据。因此,我们提出了一种基于 LDA 的主题模型,称为 PAN-LDA(大流行-潜在狄利克雷分配),它将 COVID-19 病例数据和新闻文章纳入常见的 LDA 中,以获得一组新的特征。生成的特征被引入机器学习(ML)算法作为附加特征,以改进时间序列数据的预测。此外,我们正在使用崩溃吉布斯抽样(CGS)作为参数推断的基础技术。实验结果表明,PAN-LDA 获得的特征生成了更可识别的主题,并在结果中实际增加了价值。