CAIT, Faculty of Information Science and Technology, Universiti Kebangsaan Malaysia, Bangi, Selangor, Malaysia.
Ministry of Higher Education and Scientific Research, Baghdad, Iraq.
PLoS One. 2024 Aug 23;19(8):e0309206. doi: 10.1371/journal.pone.0309206. eCollection 2024.
Clustering texts together is an essential task in data mining and information retrieval, whose aim is to group unlabeled texts into meaningful clusters that facilitate extracting and understanding useful information from large volumes of textual data. However, clustering short texts (STC) is complex because they typically contain sparse, ambiguous, noisy, and lacking information. One of the challenges for STC is finding a proper representation for short text documents to generate cohesive clusters. However, typically, STC considers only a single-view representation to do clustering. The single-view representation is inefficient for representing text due to its inability to represent different aspects of the target text. In this paper, we propose the most suitable multi-view representation (MVR) (by finding the best combination of different single-view representations) to enhance STC. Our work will explore different types of MVR based on different sets of single-view representation combinations. The combination of the single-view representations is done by a fixed length concatenation via Principal Component analysis (PCA) technique. Three standard datasets (Twitter, Google News, and StackOverflow) are used to evaluate the performances of various sets of MVRs on STC. Based on experimental results, the best combination of single-view representation as an effective for STC was the 5-views MVR (a combination of BERT, GPT, TF-IDF, FastText, and GloVe). Based on that, we can conclude that MVR improves the performance of STC; however, the design for MVR requires selective single-view representations.
将文本聚类在一起是数据挖掘和信息检索中的一项基本任务,其目的是将未标记的文本分组到有意义的簇中,从而方便从大量文本数据中提取和理解有用信息。然而,对短文本(STC)进行聚类是复杂的,因为它们通常包含稀疏、模糊、嘈杂和缺乏信息。STC 的一个挑战是为短文本文档找到合适的表示形式,以生成内聚的簇。然而,STC 通常只考虑单一视图表示来进行聚类。由于无法表示目标文本的不同方面,单一视图表示对于表示文本效率低下。在本文中,我们提出了最合适的多视图表示(MVR)(通过找到不同单一视图表示的最佳组合)来增强 STC。我们的工作将探索基于不同单一视图表示组合的不同类型的 MVR。单一视图表示的组合是通过主成分分析(PCA)技术的固定长度串联完成的。使用三个标准数据集(Twitter、Google News 和 StackOverflow)来评估不同 MVR 集在 STC 上的性能。根据实验结果,最佳的单一视图表示组合作为 STC 的有效表示是 5 视图 MVR(BERT、GPT、TF-IDF、FastText 和 GloVe 的组合)。基于此,我们可以得出结论,MVR 提高了 STC 的性能;然而,MVR 的设计需要有选择性的单一视图表示。