文献检索文档翻译深度研究
Suppr Zotero 插件Zotero 插件
邀请有礼套餐&价格历史记录

新学期,新优惠

限时优惠:9月1日-9月22日

30天高级会员仅需29元

1天体验卡首发特惠仅需5.99元

了解详情
不再提醒
插件&应用
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
高级版
套餐订阅购买积分包
AI 工具
文献检索文档翻译深度研究
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2025

使用推特数据进行伪文档模拟,以比较LDA、GSDMM和GPM主题模型在短文本和稀疏文本上的表现。

Pseudo-document simulation for comparing LDA, GSDMM and GPM topic models on short and sparse text using Twitter data.

作者信息

Weisser Christoph, Gerloff Christoph, Thielmann Anton, Python Andre, Reuter Arik, Kneib Thomas, Säfken Benjamin

机构信息

Georg-August-Universität Göttingen, Göttingen, Germany.

Campus-Institut Data Science (CIDAS), Göttingen, Germany.

出版信息

Comput Stat. 2023;38(2):647-674. doi: 10.1007/s00180-022-01246-z. Epub 2022 Jul 9.


DOI:10.1007/s00180-022-01246-z
PMID:37223721
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC10060035/
Abstract

Topic models are a useful and popular method to find latent topics of documents. However, the short and sparse texts in social media micro-blogs such as Twitter are challenging for the most commonly used Latent Dirichlet Allocation (LDA) topic model. We compare the performance of the standard LDA topic model with the Gibbs Sampler Dirichlet Multinomial Model (GSDMM) and the Gamma Poisson Mixture Model (GPM), which are specifically designed for sparse data. To compare the performance of the three models, we propose the simulation of pseudo-documents as a novel evaluation method. In a case study with short and sparse text, the models are evaluated on tweets filtered by keywords relating to the Covid-19 pandemic. We find that standard coherence scores that are often used for the evaluation of topic models perform poorly as an evaluation metric. The results of our simulation-based approach suggest that the GSDMM and GPM topic models may generate better topics than the standard LDA model.

摘要

主题模型是一种用于发现文档潜在主题的有用且流行的方法。然而,像推特这样的社交媒体微博中的短文本和稀疏文本,对于最常用的潜在狄利克雷分配(LDA)主题模型来说具有挑战性。我们将标准LDA主题模型的性能与吉布斯采样狄利克雷多项模型(GSDMM)和伽马泊松混合模型(GPM)进行了比较,这两种模型是专门为稀疏数据设计的。为了比较这三种模型的性能,我们提出了模拟伪文档作为一种新颖的评估方法。在一个针对短文本和稀疏文本的案例研究中,这些模型在通过与新冠疫情相关的关键词过滤的推文上进行评估。我们发现,常用于评估主题模型的标准一致性分数作为评估指标表现不佳。我们基于模拟的方法结果表明,GSDMM和GPM主题模型可能比标准LDA模型生成更好的主题。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9911/10060035/e615d5e461da/180_2022_1246_Fig9_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9911/10060035/e970a50d8564/180_2022_1246_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9911/10060035/d0fda02c93f6/180_2022_1246_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9911/10060035/600d226df487/180_2022_1246_Fig3_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9911/10060035/7f0fe60c17f7/180_2022_1246_Fig4_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9911/10060035/18af0ff358fb/180_2022_1246_Fig5_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9911/10060035/28f209394130/180_2022_1246_Fig6_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9911/10060035/2364827c6c72/180_2022_1246_Fig7_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9911/10060035/5cb6af63c0fd/180_2022_1246_Fig8_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9911/10060035/e615d5e461da/180_2022_1246_Fig9_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9911/10060035/e970a50d8564/180_2022_1246_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9911/10060035/d0fda02c93f6/180_2022_1246_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9911/10060035/600d226df487/180_2022_1246_Fig3_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9911/10060035/7f0fe60c17f7/180_2022_1246_Fig4_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9911/10060035/18af0ff358fb/180_2022_1246_Fig5_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9911/10060035/28f209394130/180_2022_1246_Fig6_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9911/10060035/2364827c6c72/180_2022_1246_Fig7_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9911/10060035/5cb6af63c0fd/180_2022_1246_Fig8_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9911/10060035/e615d5e461da/180_2022_1246_Fig9_HTML.jpg

相似文献

[1]
Pseudo-document simulation for comparing LDA, GSDMM and GPM topic models on short and sparse text using Twitter data.

Comput Stat. 2023

[2]
Evaluation of clustering and topic modeling methods over health-related tweets and emails.

Artif Intell Med. 2021-7

[3]
Web content topic modeling using LDA and HTML tags.

PeerJ Comput Sci. 2023-7-11

[4]
Latent IBP Compound Dirichlet Allocation.

IEEE Trans Pattern Anal Mach Intell. 2015-2

[5]
Pólya Urn Latent Dirichlet Allocation: A Doubly Sparse Massively Parallel Sampler.

IEEE Trans Pattern Anal Mach Intell. 2019-7

[6]
Impact of COVID-19 Pandemic on Social Determinants of Health Issues of Marginalized Black and Asian Communities: A Social Media Analysis Empowered by Natural Language Processing.

J Racial Ethn Health Disparities. 2025-6

[7]
Investigating the Efficient Use of Word Embedding with Neural-Topic Models for Interpretable Topics from Short Texts.

Sensors (Basel). 2022-1-23

[8]
Enabling Semantic Topic Modeling on Twitter Using MetaMap.

AMIA Jt Summits Transl Sci Proc. 2024-5-31

[9]
PAN-LDA: A latent Dirichlet allocation based novel feature extraction model for COVID-19 data using machine learning.

Comput Biol Med. 2021-11

[10]
Investigating Individuals' Perceptions Regarding the Context Around the Low Back Pain Experience: Topic Modeling Analysis of Twitter Data.

J Med Internet Res. 2021-12-23

引用本文的文献

[1]
Mining LDA topics on construction engineering change risks based on graded evidence.

PLoS One. 2024

本文引用的文献

[1]
An iterative topic model filtering framework for short and noisy user-generated data: analyzing conspiracy theories on twitter.

Int J Data Sci Anal. 2022-5-6

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

推荐工具

医学文档翻译智能文献检索