Kaleem Sidrah, Jalil Zakia, Nasir Muhammad, Alazab Moutaz
Department of Computer Science, International Islamic University, Islamabad, Islamabad, Islamabad, Pakistan.
Department of Data Science & Artificial Intelligence, International Islamic University, Islamabad, Islamabad Capital Territory, Pakistan.
PeerJ Comput Sci. 2024 Dec 11;10:e2300. doi: 10.7717/peerj-cs.2300. eCollection 2024.
Advancements in technology have placed global news at our fingertips, anytime, anywhere, through social media and online news sources. Analyzing the extensive electronic text collections is urgently needed. According to the scholars, combining the topic and word embedding models could improve text representation and help with downstream tasks related to natural language processing. However, the field of news topic recognition lacks a standardized approach to integrating topic models and word embedding models. This presents an exciting opportunity for research, as existing algorithms tend to be overly complex and miss out on the potential benefits of fusion. To overcome limitations in news text topic recognition, this research suggests a new technique word embedding latent Dirichlet allocation that combines topic models and word embeddings for better news topic recognition. This framework seamlessly integrates probabilistic topic modeling using latent Dirichlet allocation with Gibbs sampling, semantic insights from Word2Vec embeddings, and syntactic relationships to extract comprehensive text representations. Popular classifiers leverage these representations to perform automatic and precise news topic identification. Consequently, our framework seamlessly integrates document-topic relationships and contextual information, enabling superior performance, enhanced expressiveness, and efficient dimensionality reduction. Our word embedding method significantly outperforms existing approaches, reaching 88% and 97% accuracy on 20NewsGroup and BBC News in news topic recognition.
技术的进步让全球新闻随时随地触手可及,通过社交媒体和在线新闻来源即可获取。迫切需要对大量的电子文本集进行分析。据学者称,将主题模型和词嵌入模型相结合可以改善文本表示,并有助于处理与自然语言处理相关的下游任务。然而,新闻主题识别领域缺乏一种将主题模型和词嵌入模型整合在一起的标准化方法。这为研究提供了一个令人兴奋的机会,因为现有的算法往往过于复杂,错过了融合的潜在好处。为了克服新闻文本主题识别中的局限性,本研究提出了一种新技术——词嵌入潜在狄利克雷分配,它将主题模型和词嵌入相结合,以实现更好的新闻主题识别。该框架将使用潜在狄利克雷分配和吉布斯采样的概率主题建模、来自Word2Vec嵌入的语义洞察以及句法关系无缝集成,以提取全面的文本表示。流行的分类器利用这些表示来执行自动且精确的新闻主题识别。因此,我们的框架无缝集成了文档-主题关系和上下文信息,实现了卓越的性能、增强的表现力和高效的降维。我们的词嵌入方法在新闻主题识别方面显著优于现有方法,在20新闻组和BBC新闻数据集上的准确率分别达到88%和97%。