Weisser Christoph, Gerloff Christoph, Thielmann Anton, Python Andre, Reuter Arik, Kneib Thomas, Säfken Benjamin
Georg-August-Universität Göttingen, Göttingen, Germany.
Campus-Institut Data Science (CIDAS), Göttingen, Germany.
Comput Stat. 2023;38(2):647-674. doi: 10.1007/s00180-022-01246-z. Epub 2022 Jul 9.
Topic models are a useful and popular method to find latent topics of documents. However, the short and sparse texts in social media micro-blogs such as Twitter are challenging for the most commonly used Latent Dirichlet Allocation (LDA) topic model. We compare the performance of the standard LDA topic model with the Gibbs Sampler Dirichlet Multinomial Model (GSDMM) and the Gamma Poisson Mixture Model (GPM), which are specifically designed for sparse data. To compare the performance of the three models, we propose the simulation of pseudo-documents as a novel evaluation method. In a case study with short and sparse text, the models are evaluated on tweets filtered by keywords relating to the Covid-19 pandemic. We find that standard coherence scores that are often used for the evaluation of topic models perform poorly as an evaluation metric. The results of our simulation-based approach suggest that the GSDMM and GPM topic models may generate better topics than the standard LDA model.
主题模型是一种用于发现文档潜在主题的有用且流行的方法。然而,像推特这样的社交媒体微博中的短文本和稀疏文本,对于最常用的潜在狄利克雷分配(LDA)主题模型来说具有挑战性。我们将标准LDA主题模型的性能与吉布斯采样狄利克雷多项模型(GSDMM)和伽马泊松混合模型(GPM)进行了比较,这两种模型是专门为稀疏数据设计的。为了比较这三种模型的性能,我们提出了模拟伪文档作为一种新颖的评估方法。在一个针对短文本和稀疏文本的案例研究中,这些模型在通过与新冠疫情相关的关键词过滤的推文上进行评估。我们发现,常用于评估主题模型的标准一致性分数作为评估指标表现不佳。我们基于模拟的方法结果表明,GSDMM和GPM主题模型可能比标准LDA模型生成更好的主题。
Artif Intell Med. 2021-7
PeerJ Comput Sci. 2023-7-11
IEEE Trans Pattern Anal Mach Intell. 2015-2
IEEE Trans Pattern Anal Mach Intell. 2019-7
AMIA Jt Summits Transl Sci Proc. 2024-5-31