使用主题建模方法处理短文本数据：一项比较分析。

Using Topic Modeling Methods for Short-Text Data: A Comparative Analysis.

作者信息

Albalawi Rania, Yeap Tet Hin, Benyoucef Morad

机构信息

School of Information Technology and Engineering, University of Ottawa, Ottawa, ON, Canada.

Telfer School of Management, University of Ottawa, Ottawa, ON, Canada.

出版信息

Front Artif Intell. 2020 Jul 14;3:42. doi: 10.3389/frai.2020.00042. eCollection 2020.

DOI:10.3389/frai.2020.00042

PMID:33733159

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC7861298/

Abstract

With the growth of online social network platforms and applications, large amounts of textual user-generated content are created daily in the form of comments, reviews, and short-text messages. As a result, users often find it challenging to discover useful information or more on the topic being discussed from such content. Machine learning and natural language processing algorithms are used to analyze the massive amount of textual social media data available online, including topic modeling techniques that have gained popularity in recent years. This paper investigates the topic modeling subject and its common application areas, methods, and tools. Also, we examine and compare five frequently used topic modeling methods, as applied to short textual social data, to show their benefits practically in detecting important topics. These methods are latent semantic analysis, latent Dirichlet allocation, non-negative matrix factorization, random projection, and principal component analysis. Two textual datasets were selected to evaluate the performance of included topic modeling methods based on the topic quality and some standard statistical evaluation metrics, like recall, precision, -score, and topic coherence. As a result, latent Dirichlet allocation and non-negative matrix factorization methods delivered more meaningful extracted topics and obtained good results. The paper sheds light on some common topic modeling methods in a short-text context and provides direction for researchers who seek to apply these methods.

摘要

随着在线社交网络平台和应用程序的发展，每天都会以评论、评价和短消息的形式产生大量用户生成的文本内容。因此，用户常常发现从这些内容中发现有用信息或关于正在讨论主题的更多信息具有挑战性。机器学习和自然语言处理算法被用于分析在线可得的海量文本社交媒体数据，包括近年来颇受欢迎的主题建模技术。本文研究主题建模主题及其常见应用领域、方法和工具。此外，我们研究并比较了五种常用于短文本社交数据的主题建模方法，以实际展示它们在检测重要主题方面的优势。这些方法是潜在语义分析、潜在狄利克雷分配、非负矩阵分解、随机投影和主成分分析。选择了两个文本数据集，基于主题质量和一些标准统计评估指标（如召回率、精确率、F值和主题连贯性）来评估所纳入主题建模方法的性能。结果，潜在狄利克雷分配和非负矩阵分解方法提取出了更有意义的主题并取得了良好的结果。本文揭示了短文本语境下一些常见的主题建模方法，并为寻求应用这些方法的研究人员提供了方向。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/c977/7861298/095797722808/frai-03-00042-g0001.jpg

相似文献

Using Topic Modeling Methods for Short-Text Data: A Comparative Analysis.使用主题建模方法处理短文本数据：一项比较分析。

Front Artif Intell. 2020 Jul 14;3:42. doi: 10.3389/frai.2020.00042. eCollection 2020.

Evaluation of clustering and topic modeling methods over health-related tweets and emails.健康相关推文和电子邮件的聚类和主题建模方法评估。

Artif Intell Med. 2021 Jul;117:102096. doi: 10.1016/j.artmed.2021.102096. Epub 2021 May 7.

An integrated clustering and BERT framework for improved topic modeling.一种用于改进主题建模的集成聚类和BERT框架。

Int J Inf Technol. 2023;15(4):2187-2195. doi: 10.1007/s41870-023-01268-w. Epub 2023 May 6.

The Voice of Chinese Health Consumers: A Text Mining Approach to Web-Based Physician Reviews.中国医疗消费者之声：一种基于网络医生评价的文本挖掘方法。

J Med Internet Res. 2016 May 10;18(5):e108. doi: 10.2196/jmir.4430.

Investigating the Efficient Use of Word Embedding with Neural-Topic Models for Interpretable Topics from Short Texts.研究基于神经主题模型的词向量有效利用，以实现短文本的可解释主题。

Sensors (Basel). 2022 Jan 23;22(3):852. doi: 10.3390/s22030852.

Short text topic modelling approaches in the context of big data: taxonomy, survey, and analysis.大数据背景下的短文本主题建模方法：分类、综述与分析

Artif Intell Rev. 2023;56(6):5133-5260. doi: 10.1007/s10462-022-10254-w. Epub 2022 Oct 26.

Automating Large-scale Health Care Service Feedback Analysis: Sentiment Analysis and Topic Modeling Study.大规模医疗保健服务反馈分析自动化：情感分析与主题建模研究

JMIR Med Inform. 2022 Apr 11;10(4):e29385. doi: 10.2196/29385.

Short text topic modelling using local and global word-context semantic correlation.使用局部和全局词上下文语义相关性的短文本主题建模

Multimed Tools Appl. 2023 Feb 2:1-23. doi: 10.1007/s11042-023-14352-x.

Web content topic modeling using LDA and HTML tags.使用潜在狄利克雷分配（LDA）和HTML标签的网页内容主题建模

PeerJ Comput Sci. 2023 Jul 11;9:e1459. doi: 10.7717/peerj-cs.1459. eCollection 2023.

Modeling virtual organizations with Latent Dirichlet Allocation: a case for natural language processing.使用潜在狄利克雷分配对虚拟组织进行建模：自然语言处理的一个案例

Neural Netw. 2014 Oct;58:38-49. doi: 10.1016/j.neunet.2014.05.008. Epub 2014 Jun 2.

引用本文的文献

Decoding HIV Discourse on Social Media: Large-Scale Analysis of 191,972 Tweets Using Machine Learning, Topic Modeling, and Temporal Analysis.解码社交媒体上关于艾滋病病毒的话语：使用机器学习、主题建模和时间分析对191,972条推文进行大规模分析

J Med Internet Res. 2025 Aug 29;27:e76745. doi: 10.2196/76745.

Public attitudes to potential synthetic cells applications: Pragmatic support and ethical acceptance.公众对潜在合成细胞应用的态度：务实支持与伦理接受。

PLoS One. 2025 Feb 27;20(2):e0319337. doi: 10.1371/journal.pone.0319337. eCollection 2025.

Development and validation of an automated machine for self-injury assessment via young Koreans' natural writings.通过韩国年轻人的自然书写开发并验证一种用于自我伤害评估的自动化机器。

PLoS One. 2025 Jan 16;20(1):e0316619. doi: 10.1371/journal.pone.0316619. eCollection 2025.

Combining Topic Modeling, Sentiment Analysis, and Corpus Linguistics to Analyze Unstructured Web-Based Patient Experience Data: Case Study of Modafinil Experiences.结合主题建模、情感分析和语料库语言学来分析基于网络的非结构化患者体验数据：莫达非尼体验的案例研究。

J Med Internet Res. 2024 Dec 11;26:e54321. doi: 10.2196/54321.

Mental illness detection through harvesting social media: a comprehensive literature review.通过挖掘社交媒体进行精神疾病检测：一项全面的文献综述

PeerJ Comput Sci. 2024 Oct 7;10:e2296. doi: 10.7717/peerj-cs.2296. eCollection 2024.

Enhanced analysis of large-scale news text data using the bidirectional-Kmeans-LSTM-CNN model.使用双向K均值-长短期记忆网络-卷积神经网络模型对大规模新闻文本数据进行增强分析。

PeerJ Comput Sci. 2024 Aug 1;10:e2213. doi: 10.7717/peerj-cs.2213. eCollection 2024.

Fitness or socializing - A multi-dimensional analysis of online fitness communities users.健身还是社交——对在线健身社区用户的多维度分析

iScience. 2024 Apr 17;27(7):109753. doi: 10.1016/j.isci.2024.109753. eCollection 2024 Jul 19.

Exploring hot topics and evolutionary paths in the Diagnosis-Related Groups (DRGs) field: a comparative study using LDA modeling.探讨诊断相关分组（DRGs）领域的热点话题和演进路径：基于 LDA 模型的比较研究。

BMC Health Serv Res. 2024 Jun 21;24(1):756. doi: 10.1186/s12913-024-11209-3.

Investigating topic modeling techniques through evaluation of topics discovered in short texts data across diverse domains.通过评估在不同领域的短文本数据中发现的主题来研究主题建模技术。

Sci Rep. 2024 May 25;14(1):12003. doi: 10.1038/s41598-024-61738-4.

Evolution of renewable energy laws and policies in China.中国可再生能源法律法规与政策的演变

Heliyon. 2024 Apr 18;10(8):e29712. doi: 10.1016/j.heliyon.2024.e29712. eCollection 2024 Apr 30.

本文引用的文献

An overview of topic modeling and its current applications in bioinformatics.主题建模概述及其在生物信息学中的当前应用。

Springerplus. 2016 Sep 20;5(1):1608. doi: 10.1186/s40064-016-3252-8. eCollection 2016.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

使用主题建模方法处理短文本数据：一项比较分析。

Using Topic Modeling Methods for Short-Text Data: A Comparative Analysis.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献