Zhejiang University, Hangzhou, China.
Zhejiang University, Hangzhou, China.
Neural Netw. 2023 Aug;165:987-998. doi: 10.1016/j.neunet.2023.06.026. Epub 2023 Jun 28.
Current distributed graph training frameworks evenly partition a large graph into small chunks to suit distributed storage, leverage a uniform interface to access neighbors, and train graph neural networks in a cluster of machines to update weights. Nevertheless, they consider a separate design of storage and training, resulting in huge communication costs for retrieving neighborhoods. During the storage phase, traditional heuristic graph partitioning not only suffers from memory overhead because of loading the full graph into the memory but also damages semantically related structures because of its neglecting meaningful node attributes. What is more, in the weight-update phase, directly averaging synchronization is difficult to tackle with heterogeneous local models where each machine's data are loaded from different subgraphs, resulting in slow convergence. To solve these problems, we propose a novel distributed graph training approach, attribute-driven streaming edge partitioning with reconciliations (ASEPR), where the local model loads only the subgraph stored on its own machine to make fewer communications. ASEPR firstly clusters nodes with similar attributes in the same partition to maintain semantic structure and keep multihop neighbor locality. Then streaming partitioning combined with attribute clustering is applied to subgraph assignment to alleviate memory overhead. After local graph neural network training on distributed machines, we deploy cross-layer reconciliation strategies for heterogeneous local models to improve the averaged global model by knowledge distillation and contrastive learning. Extensive experiments conducted on four large graph datasets on node classification and link prediction tasks show that our model outperforms DistDGL, with fewer resource requirements and up to quadruple the convergence speed.
当前的分布式图训练框架将大型图均匀地划分为小块,以适应分布式存储,利用统一的接口访问邻居,并在机器集群中训练图神经网络以更新权重。然而,它们考虑了存储和训练的单独设计,导致在检索邻域时会产生巨大的通信成本。在存储阶段,传统的启发式图划分不仅由于需要将整个图加载到内存中而导致内存开销大,而且由于忽略了有意义的节点属性,还会破坏语义相关的结构。更重要的是,在权重更新阶段,由于每个机器的数据都从不同的子图加载,因此直接进行平均同步对于具有异构局部模型的情况很难处理,这导致收敛速度较慢。为了解决这些问题,我们提出了一种新颖的分布式图训练方法,即具有协调的属性驱动流边分区(ASEPR),其中局部模型仅加载存储在其自身机器上的子图,从而减少通信次数。ASEPR 首先将具有相似属性的节点聚类到同一个分区中,以保持语义结构并保持多跳邻居的局部性。然后,将流分区与属性聚类结合应用于子图分配,以减轻内存开销。在分布式机器上进行局部图神经网络训练后,我们部署跨层协调策略来解决异构局部模型的问题,通过知识蒸馏和对比学习来提高平均全局模型的性能。在节点分类和链接预测任务的四个大型图数据集上进行的广泛实验表明,我们的模型在资源需求更少的情况下表现优于 DistDGL,并且收敛速度提高了四倍。