Zhang Yichi, Tang Minh
IEEE Trans Pattern Anal Mach Intell. 2024 Feb;46(2):1065-1078. doi: 10.1109/TPAMI.2023.3327631. Epub 2024 Jan 10.
Random-walk-based network embedding algorithms like DeepWalk and node2vec are widely used to obtain euclidean representation of the nodes in a network prior to performing downstream inference tasks. However, despite their impressive empirical performance, there is a lack of theoretical results explaining their large-sample behavior. In this paper, we study node2vec and DeepWalk through the perspective of matrix factorization. In particular, we analyze these algorithms in the setting of community detection for stochastic blockmodel graphs (and their degree-corrected variants). By exploiting the row-wise uniform perturbation bound for leading singular vectors, we derive high-probability error bounds between the matrix factorization-based node2vec/DeepWalk embeddings and their true counterparts, uniformly over all node embeddings. Based on strong concentration results, we further show the perfect membership recovery by node2vec/DeepWalk, followed by K-means/medians algorithms. Specifically, as the network becomes sparser, our results guarantee that with large enough window size and vertex number, applying K-means/medians on the matrix factorization-based node2vec embeddings can, with high probability, correctly recover the memberships of all vertices in a network generated from the stochastic blockmodel (or its degree-corrected variants). The theoretical justifications are mirrored in the numerical experiments and real data applications, for both the original node2vec and its matrix factorization variant.
像DeepWalk和node2vec这样基于随机游走的网络嵌入算法被广泛用于在执行下游推理任务之前获得网络中节点的欧几里得表示。然而,尽管它们在实证性能方面令人印象深刻,但缺乏解释其大样本行为的理论结果。在本文中,我们从矩阵分解的角度研究node2vec和DeepWalk。具体而言,我们在随机块模型图(及其度校正变体)的社区检测设置中分析这些算法。通过利用主导奇异向量的行方向均匀扰动界,我们在所有节点嵌入上均匀地推导了基于矩阵分解的node2vec/DeepWalk嵌入与其真实对应物之间的高概率误差界。基于强集中结果,我们进一步展示了node2vec/DeepWalk与K均值/中位数算法一起实现的完美成员恢复。具体来说,随着网络变得更加稀疏,我们的结果保证,在窗口大小和顶点数量足够大的情况下,对基于矩阵分解的node2vec嵌入应用K均值/中位数算法能够以高概率正确恢复由随机块模型(或其度校正变体)生成的网络中所有顶点的成员关系。对于原始的node2vec及其矩阵分解变体,理论依据在数值实验和实际数据应用中都得到了体现。