Department of Statistics, Purdue University, West Lafayette, IN 47907, United States of America.
Department of Statistics, Purdue University, West Lafayette, IN 47907, United States of America.
Neural Netw. 2024 Nov;179:106512. doi: 10.1016/j.neunet.2024.106512. Epub 2024 Jul 11.
Network embedding is a general-purpose machine learning technique that converts network data from non-Euclidean space to Euclidean space, facilitating downstream analyses for the networks. However, existing embedding methods are often optimization-based, with the embedding dimension determined in a heuristic or ad hoc way, which can cause potential bias in downstream statistical inference. Additionally, existing deep embedding methods can suffer from a nonidentifiability issue due to the universal approximation power of deep neural networks. We address these issues within a rigorous statistical framework. We treat the embedding vectors as missing data, reconstruct the network features using a sparse decoder, and simultaneously impute the embedding vectors and train the sparse decoder using an adaptive stochastic gradient Markov chain Monte Carlo (MCMC) algorithm. Under mild conditions, we show that the sparse decoder provides a parsimonious mapping from the embedding space to network features, enabling effective selection of the embedding dimension and overcoming the nonidentifiability issue encountered by existing deep embedding methods. Furthermore, we show that the embedding vectors converge weakly to a desired posterior distribution in the 2-Wasserstein distance, addressing the potential bias issue experienced by existing embedding methods. This work lays down the first theoretical foundation for network embedding within the framework of missing data imputation.
网络嵌入是一种通用的机器学习技术,它将非欧几里得空间的网络数据转换到欧几里得空间,便于对网络进行下游分析。然而,现有的嵌入方法通常是基于优化的,嵌入维度以启发式或特别的方式确定,这可能会导致下游统计推断中的潜在偏差。此外,由于深度神经网络的通用逼近能力,现有的深度嵌入方法可能会存在不可识别性问题。我们在严格的统计框架内解决这些问题。我们将嵌入向量视为缺失数据,使用稀疏解码器重建网络特征,并同时使用自适应随机梯度马尔可夫链蒙特卡罗(MCMC)算法对嵌入向量进行推断和训练稀疏解码器。在温和的条件下,我们表明稀疏解码器提供了从嵌入空间到网络特征的简约映射,从而有效地选择嵌入维度,并克服了现有深度嵌入方法遇到的不可识别性问题。此外,我们表明,在 2-Wasserstein 距离下,嵌入向量弱收敛到期望的后验分布,解决了现有嵌入方法中存在的潜在偏差问题。这项工作为缺失数据推断框架内的网络嵌入奠定了第一个理论基础。