Suppr超能文献

使用主题建模方法处理短文本数据:一项比较分析。

Using Topic Modeling Methods for Short-Text Data: A Comparative Analysis.

作者信息

Albalawi Rania, Yeap Tet Hin, Benyoucef Morad

机构信息

School of Information Technology and Engineering, University of Ottawa, Ottawa, ON, Canada.

Telfer School of Management, University of Ottawa, Ottawa, ON, Canada.

出版信息

Front Artif Intell. 2020 Jul 14;3:42. doi: 10.3389/frai.2020.00042. eCollection 2020.

Abstract

With the growth of online social network platforms and applications, large amounts of textual user-generated content are created daily in the form of comments, reviews, and short-text messages. As a result, users often find it challenging to discover useful information or more on the topic being discussed from such content. Machine learning and natural language processing algorithms are used to analyze the massive amount of textual social media data available online, including topic modeling techniques that have gained popularity in recent years. This paper investigates the topic modeling subject and its common application areas, methods, and tools. Also, we examine and compare five frequently used topic modeling methods, as applied to short textual social data, to show their benefits practically in detecting important topics. These methods are latent semantic analysis, latent Dirichlet allocation, non-negative matrix factorization, random projection, and principal component analysis. Two textual datasets were selected to evaluate the performance of included topic modeling methods based on the topic quality and some standard statistical evaluation metrics, like recall, precision, -score, and topic coherence. As a result, latent Dirichlet allocation and non-negative matrix factorization methods delivered more meaningful extracted topics and obtained good results. The paper sheds light on some common topic modeling methods in a short-text context and provides direction for researchers who seek to apply these methods.

摘要

随着在线社交网络平台和应用程序的发展,每天都会以评论、评价和短消息的形式产生大量用户生成的文本内容。因此,用户常常发现从这些内容中发现有用信息或关于正在讨论主题的更多信息具有挑战性。机器学习和自然语言处理算法被用于分析在线可得的海量文本社交媒体数据,包括近年来颇受欢迎的主题建模技术。本文研究主题建模主题及其常见应用领域、方法和工具。此外,我们研究并比较了五种常用于短文本社交数据的主题建模方法,以实际展示它们在检测重要主题方面的优势。这些方法是潜在语义分析、潜在狄利克雷分配、非负矩阵分解、随机投影和主成分分析。选择了两个文本数据集,基于主题质量和一些标准统计评估指标(如召回率、精确率、F值和主题连贯性)来评估所纳入主题建模方法的性能。结果,潜在狄利克雷分配和非负矩阵分解方法提取出了更有意义的主题并取得了良好的结果。本文揭示了短文本语境下一些常见的主题建模方法,并为寻求应用这些方法的研究人员提供了方向。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/c977/7861298/095797722808/frai-03-00042-g0001.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验