使用推特数据进行伪文档模拟，以比较LDA、GSDMM和GPM主题模型在短文本和稀疏文本上的表现。

Topic models are a useful and popular method to find latent topics of documents. However, the short and sparse texts in social media micro-blogs such as Twitter are challenging for the most commonly used Latent Dirichlet Allocation (LDA) topic model. We compare the performance of the standard LDA topic model with the Gibbs Sampler Dirichlet Multinomial Model (GSDMM) and the Gamma Poisson Mixture Model (GPM), which are specifically designed for sparse data. To compare the performance of the three models, we propose the simulation of pseudo-documents as a novel evaluation method. In a case study with short and sparse text, the models are evaluated on tweets filtered by keywords relating to the Covid-19 pandemic. We find that standard coherence scores that are often used for the evaluation of topic models perform poorly as an evaluation metric. The results of our simulation-based approach suggest that the GSDMM and GPM topic models may generate better topics than the standard LDA model.

主题模型是一种用于发现文档潜在主题的有用且流行的方法。然而，像推特这样的社交媒体微博中的短文本和稀疏文本，对于最常用的潜在狄利克雷分配（LDA）主题模型来说具有挑战性。我们将标准LDA主题模型的性能与吉布斯采样狄利克雷多项模型（GSDMM）和伽马泊松混合模型（GPM）进行了比较，这两种模型是专门为稀疏数据设计的。为了比较这三种模型的性能，我们提出了模拟伪文档作为一种新颖的评估方法。在一个针对短文本和稀疏文本的案例研究中，这些模型在通过与新冠疫情相关的关键词过滤的推文上进行评估。我们发现，常用于评估主题模型的标准一致性分数作为评估指标表现不佳。我们基于模拟的方法结果表明，GSDMM和GPM主题模型可能比标准LDA模型生成更好的主题。

新学期，新优惠

Suppr 超能文献

新学期，新优惠

Suppr 超能文献

Pseudo-document simulation for comparing LDA, GSDMM and GPM topic models on short and sparse text using Twitter data.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

推荐工具