Kinariwala Supriya, Deshmukh Sachin
Maharashtra Institute of Technology, Maharashtra Aurangabad, India.
Dr. Babasaheb Ambedkar Marathwada University, Aurangabad, Maharashtra India.
Multimed Tools Appl. 2023 Feb 2:1-23. doi: 10.1007/s11042-023-14352-x.
Nowadays, people use short text to portray their opinions on platforms of social media such as Twitter, Facebook, and YouTube, as well as on e-commerce websites such as Amazon and Flipkart to share their commercial purchasing experiences. Every day, billions of short texts are created worldwide in tweets, tags, keywords, search queries etc. However, this short text possesses inadequate contextual information, which can be ambiguous, sparse, noisy, remains a major challenge. State-of-the-art strategies of topic modeling such as Latent Dirichlet Allocation and Probabilistic Latent Semantic Analysis are not suitable as it contains a limited number of words in a single document. This work proposes a new model named G_SeaNMF (Gensim_SeaNMF) to improve the word-context semantic relationship by using local and global word embedding techniques. Word embeddings learned from a large corpus provide general semantic and syntactic information about words; it can guide topic modeling for short text collections as supporting information for sparse co-occurrence patterns. In the proposed model, SeaNMF (Semantics-assisted Non-negative Matrix Factorization) is incorporated with word2vec model of Gensim library to strengthen the word's semantic relationship. In this article, a short text topic modeling techniques based on DMM (Dirichlet Multinomial Mixture), self-aggregation and global word co-occurrence were explored. These are evaluated using different measures to gauge cluster coherence on real-world datasets such as Search Snippet, Biomedicine, Pascal Flickr, Tweet and TagMyNews. Empirical evaluation shows that a combination of local and global word embedding provides more appropriate words under each topic with improved outcomes.
如今,人们在推特、脸书和优兔等社交媒体平台上,以及在亚马逊和Flipkart等电子商务网站上,使用短文本表达自己的观点,分享商业购买体验。每天,全球都会产生数十亿条推文、标签、关键词、搜索查询等形式的短文本。然而,这种短文本所包含的上下文信息不足,可能存在模糊、稀疏、有噪声等问题,仍然是一个重大挑战。诸如潜在狄利克雷分配和概率潜在语义分析等先进的主题建模策略并不适用,因为单个文档中的单词数量有限。这项工作提出了一种名为G_SeaNMF(Gensim_SeaNMF)的新模型,通过使用局部和全局词嵌入技术来改善词与上下文的语义关系。从大型语料库中学习到的词嵌入提供了关于单词的一般语义和句法信息;它可以作为稀疏共现模式的支持信息,指导短文本集合的主题建模。在所提出的模型中,SeaNMF(语义辅助非负矩阵分解)与Gensim库的word2vec模型相结合,以加强单词的语义关系。在本文中,探索了基于狄利克雷多项式混合、自聚合和全局词共现的短文本主题建模技术。使用不同的度量标准对这些技术进行评估,以衡量在搜索片段、生物医学、帕斯卡图片、推文和TagMyNews等真实世界数据集上的聚类一致性。实证评估表明,局部和全局词嵌入的结合在每个主题下提供了更合适的单词,且效果有所改善。