Department of Financial Information Security, Kookmin University, Seoul 02707, Korea.
Department of Software, College of Computer Science, Kookmin University, Seoul 02707, Korea.
Sensors (Basel). 2021 May 4;21(9):3196. doi: 10.3390/s21093196.
Text document clustering refers to the unsupervised classification of textual documents into clusters based on content similarity and can be applied in applications such as search optimization and extracting hidden information from data generated by IoT sensors. Swarm intelligence (SI) algorithms use stochastic and heuristic principles that include simple and unintelligent individuals that follow some simple rules to accomplish very complex tasks. By mapping features of problems to parameters of SI algorithms, SI algorithms can achieve solutions in a flexible, robust, decentralized, and self-organized manner. Compared to traditional clustering algorithms, these solving mechanisms make swarm algorithms suitable for resolving complex document clustering problems. However, each SI algorithm shows a different performance based on its own strengths and weaknesses. In this paper, to find the best performing SI algorithm in text document clustering, we performed a comparative study for the PSO, bat, grey wolf optimization (GWO), and K-means algorithms using six data sets of various sizes, which were created from BBC Sport news and 20 newsgroups. Based on our experimental results, we discuss the features of a document clustering problem with the nature of SI algorithms and conclude that the PSO and GWO SI algorithms are better than K-means, and among those algorithms, the PSO performs best in terms of finding the optimal solution.
文本聚类是指根据内容相似度将文本文档无监督地分为聚类,可应用于搜索优化和从物联网传感器生成的数据中提取隐藏信息等领域。群体智能 (SI) 算法使用随机和启发式原则,包括遵循一些简单规则的简单和非智能个体,以完成非常复杂的任务。通过将问题的特征映射到 SI 算法的参数上,SI 算法可以以灵活、稳健、去中心化和自组织的方式实现解决方案。与传统聚类算法相比,这些求解机制使群智能算法适合解决复杂的文档聚类问题。然而,每个 SI 算法都根据自身的优缺点表现出不同的性能。在本文中,为了找到在文本文档聚类中表现最佳的 SI 算法,我们使用六个不同大小的数据集(来自 BBC 体育新闻和 20 个新闻组)对 PSO、蝙蝠、灰狼优化 (GWO) 和 K-means 算法进行了比较研究。根据我们的实验结果,我们讨论了具有 SI 算法性质的文档聚类问题的特征,并得出结论,PSO 和 GWO SI 算法优于 K-means,而在这些算法中,PSO 在找到最优解方面表现最佳。