IEEE Trans Pattern Anal Mach Intell. 2023 Jul;45(7):8729-8742. doi: 10.1109/TPAMI.2022.3231470. Epub 2023 Jun 5.
Since the representative capacity of graph-based clustering methods is usually limited by the graph constructed on the original features, it is attractive to find whether graph neural networks (GNNs), a strong extension of neural networks to graphs, can be applied to augment the capacity of graph-based clustering methods. The core problems mainly come from two aspects. On the one hand, the graph is unavailable in the most general clustering scenes so that how to construct graph on the non-graph data and the quality of graph is usually the most important part. On the other hand, given n samples, the graph-based clustering methods usually consume at least O(n) time to build graphs and the graph convolution requires nearly O(n) for a dense graph and O(|E|) for a sparse one with |E| edges. Accordingly, both graph-based clustering and GNNs suffer from the severe inefficiency problem. To tackle these problems, we propose a novel clustering method, AnchorGAE, with the self-supervised estimation of graph and efficient graph convolution. We first show how to convert a non-graph dataset into a graph dataset, by introducing the generative graph model and anchors. A bipartite graph is built via generating anchors and estimating the connectivity distributions of original points and anchors. We then show that the constructed bipartite graph can reduce the computational complexity of graph convolution from O(n) and O(|E|) to O(n). The succeeding steps for clustering can be easily designed as O(n) operations. Interestingly, the anchors naturally lead to siamese architecture with the help of the Markov process. Furthermore, the estimated bipartite graph is updated dynamically according to the features extracted by GNN modules, to promote the quality of the graph by exploiting the high-level information by GNNs. However, we theoretically prove that the self-supervised paradigm frequently results in a collapse that often occurs after 2-3 update iterations in experiments, especially when the model is well-trained. A specific strategy is accordingly designed to prevent the collapse. The experiments support the theoretical analysis and show the superiority of AnchorGAE.
由于基于图的聚类方法的表示能力通常受到原始特征上构建的图的限制,因此寻找图神经网络(GNN)是否可以应用于增强基于图的聚类方法的能力是很有吸引力的。核心问题主要来自两个方面。一方面,在最一般的聚类场景中,图是不可用的,因此如何在非图数据上构建图以及图的质量通常是最重要的部分。另一方面,对于 n 个样本,基于图的聚类方法通常至少需要 O(n)时间来构建图,而图卷积对于密集图需要近 O(n)时间,对于稀疏图(边数为|E|)需要 O(|E|)时间。因此,基于图的聚类和 GNN 都存在严重的效率问题。为了解决这些问题,我们提出了一种新的聚类方法 AnchorGAE,它具有图的自监督估计和高效图卷积。我们首先展示如何通过引入生成图模型和锚点将非图数据集转换为图数据集。通过生成锚点并估计原始点和锚点的连通性分布,构建了一个二分图。然后,我们展示了所构建的二分图可以将图卷积的计算复杂度从 O(n)和 O(|E|)降低到 O(n)。随后的聚类步骤可以很容易地设计为 O(n)操作。有趣的是,锚点在马尔可夫过程的帮助下自然地导致了孪生结构。此外,根据 GNN 模块提取的特征,动态更新估计的二分图,通过利用 GNN 的高级信息来提高图的质量。然而,我们从理论上证明了自监督范式经常导致崩溃,尤其是在模型训练良好的情况下,在实验中经常在 2-3 次更新迭代后发生崩溃。因此,设计了一种特定的策略来防止崩溃。实验支持了理论分析,并展示了 AnchorGAE 的优越性。