Abdalgader Khaled, Matroud Atheer A, Hossin Khaled
Department of Computer Science and Engineering, American University of Ras Al Khaimah, Ras Al Khaimah, United Arab Emirates.
De Montfort University-Dubai, Dubai, United Arab Emirates.
PeerJ Comput Sci. 2024 May 29;10:e2078. doi: 10.7717/peerj-cs.2078. eCollection 2024.
Sentence clustering plays a central role in various text-processing activities and has received extensive attention for measuring semantic similarity between compared sentences. However, relatively little focus has been placed on evaluating clustering performance using available similarity measures that adopt low-dimensional continuous representations. Such representations are crucial in domains like sentence clustering, where traditional word co-occurrence representations often achieve poor results when clustering semantically similar sentences that share no common words. This article presents a new implementation that incorporates a sentence similarity measure based on the notion of embedding representation for evaluating the performance of three types of text clustering methods: partitional clustering, hierarchical clustering, and fuzzy clustering, on standard textual datasets. This measure derives its semantic information from pre-training models designed to simulate human knowledge about words in natural language. The article also compares the performance of the used similarity measure by training it on two state-of-the-art pre-training models to investigate which yields better results. We argue that the superior performance of the selected clustering methods stems from their more effective use of the semantic information offered by this embedding-based similarity measure. Furthermore, we use hierarchical clustering, the best-performing method, for a text summarization task and report the results. The implementation in this article demonstrates that incorporating the sentence embedding measure leads to significantly improved performance in both text clustering and text summarization tasks.
句子聚类在各种文本处理活动中起着核心作用,并且在测量被比较句子之间的语义相似度方面受到了广泛关注。然而,对于使用采用低维连续表示的可用相似度度量来评估聚类性能,人们相对关注较少。这样的表示在句子聚类等领域至关重要,在这些领域中,当对没有共同单词的语义相似句子进行聚类时,传统的词共现表示往往效果不佳。本文提出了一种新的实现方法,该方法纳入了一种基于嵌入表示概念的句子相似度度量,用于在标准文本数据集上评估三种类型的文本聚类方法的性能:划分聚类、层次聚类和模糊聚类。这种度量从旨在模拟人类关于自然语言中单词知识的预训练模型中获取语义信息。本文还通过在两个最先进的预训练模型上对所使用的相似度度量进行训练来比较其性能,以研究哪种模型能产生更好的结果。我们认为,所选聚类方法的优越性能源于它们更有效地利用了这种基于嵌入的相似度度量所提供的语义信息。此外,我们将性能最佳的层次聚类方法用于文本摘要任务并报告结果。本文中的实现表明,纳入句子嵌入度量会在文本聚类和文本摘要任务中显著提高性能。