Suppr超能文献

词嵌入助力新闻文章中的主题识别。

Word embedding empowered topic recognition in news articles.

作者信息

Kaleem Sidrah, Jalil Zakia, Nasir Muhammad, Alazab Moutaz

机构信息

Department of Computer Science, International Islamic University, Islamabad, Islamabad, Islamabad, Pakistan.

Department of Data Science & Artificial Intelligence, International Islamic University, Islamabad, Islamabad Capital Territory, Pakistan.

出版信息

PeerJ Comput Sci. 2024 Dec 11;10:e2300. doi: 10.7717/peerj-cs.2300. eCollection 2024.

Abstract

Advancements in technology have placed global news at our fingertips, anytime, anywhere, through social media and online news sources. Analyzing the extensive electronic text collections is urgently needed. According to the scholars, combining the topic and word embedding models could improve text representation and help with downstream tasks related to natural language processing. However, the field of news topic recognition lacks a standardized approach to integrating topic models and word embedding models. This presents an exciting opportunity for research, as existing algorithms tend to be overly complex and miss out on the potential benefits of fusion. To overcome limitations in news text topic recognition, this research suggests a new technique word embedding latent Dirichlet allocation that combines topic models and word embeddings for better news topic recognition. This framework seamlessly integrates probabilistic topic modeling using latent Dirichlet allocation with Gibbs sampling, semantic insights from Word2Vec embeddings, and syntactic relationships to extract comprehensive text representations. Popular classifiers leverage these representations to perform automatic and precise news topic identification. Consequently, our framework seamlessly integrates document-topic relationships and contextual information, enabling superior performance, enhanced expressiveness, and efficient dimensionality reduction. Our word embedding method significantly outperforms existing approaches, reaching 88% and 97% accuracy on 20NewsGroup and BBC News in news topic recognition.

摘要

技术的进步让全球新闻随时随地触手可及,通过社交媒体和在线新闻来源即可获取。迫切需要对大量的电子文本集进行分析。据学者称,将主题模型和词嵌入模型相结合可以改善文本表示,并有助于处理与自然语言处理相关的下游任务。然而,新闻主题识别领域缺乏一种将主题模型和词嵌入模型整合在一起的标准化方法。这为研究提供了一个令人兴奋的机会,因为现有的算法往往过于复杂,错过了融合的潜在好处。为了克服新闻文本主题识别中的局限性,本研究提出了一种新技术——词嵌入潜在狄利克雷分配,它将主题模型和词嵌入相结合,以实现更好的新闻主题识别。该框架将使用潜在狄利克雷分配和吉布斯采样的概率主题建模、来自Word2Vec嵌入的语义洞察以及句法关系无缝集成,以提取全面的文本表示。流行的分类器利用这些表示来执行自动且精确的新闻主题识别。因此,我们的框架无缝集成了文档-主题关系和上下文信息,实现了卓越的性能、增强的表现力和高效的降维。我们的词嵌入方法在新闻主题识别方面显著优于现有方法,在20新闻组和BBC新闻数据集上的准确率分别达到88%和97%。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/b8c8/11784532/834ef0f8643d/peerj-cs-10-2300-g001.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验