Shalileh Soroosh, Mirkin Boris
Center for Language and Brain, HSE University, Myasnitskaya Ulitsa 20, 101000 Moscow, Russia.
Department of Data Analysis and Artificial Intelligence, HSE University, Pokrovsky Boulevard, 11, 101000 Moscow, Russia.
Entropy (Basel). 2022 Apr 29;24(5):626. doi: 10.3390/e24050626.
This paper proposes a meaningful and effective extension of the celebrated K-means algorithm to detect communities in feature-rich networks, due to our assumption of non-summability mode. We least-squares approximate given matrices of inter-node links and feature values, leading to a straightforward extension of the conventional K-means clustering method as an alternating minimization strategy for the criterion. This works in a two-fold space, embracing both the network nodes and features. The metric used is a weighted sum of the squared Euclidean distances in the feature and network spaces. To tackle the so-called curse of dimensionality, we extend this to a version that uses the cosine distances between entities and centers. One more version of our method is based on the Manhattan distance metric. We conduct computational experiments to test our method and compare its performances with those by competing popular algorithms at synthetic and real-world datasets. The cosine-based version of the extended K-means typically wins at the high-dimension real-world datasets. In contrast, the Manhattan-based version wins at most synthetic datasets.
由于我们对非可加性模式的假设,本文提出了一种对著名的K均值算法有意义且有效的扩展,用于在特征丰富的网络中检测社区。我们通过最小二乘法逼近给定的节点间链接矩阵和特征值矩阵,从而将传统的K均值聚类方法直接扩展为一种针对该准则的交替最小化策略。这在一个双重空间中起作用,同时包含网络节点和特征。所使用的度量是特征空间和网络空间中平方欧几里得距离的加权和。为了解决所谓的维度诅咒问题,我们将其扩展为一个使用实体与中心之间余弦距离的版本。我们方法的另一个版本基于曼哈顿距离度量。我们进行计算实验来测试我们的方法,并将其性能与在合成数据集和真实世界数据集上竞争的流行算法的性能进行比较。扩展K均值的基于余弦的版本通常在高维真实世界数据集上获胜。相比之下,基于曼哈顿的版本在大多数合成数据集上获胜。