School of Electronic Information and Communications, Huazhong University of Science and Technology, Wuhan 430074, China.
School of Computing and Information Systems, The University of Melbourne, Parkville, VIC 3010, Australia.
Neural Netw. 2024 Dec;180:106635. doi: 10.1016/j.neunet.2024.106635. Epub 2024 Aug 14.
Graph neural networks (GNNs) have become a popular approach for semi-supervised graph representation learning. GNNs research has generally focused on improving methodological details, whereas less attention has been paid to exploring the importance of labeling the data. However, for semi-supervised learning, the quality of training data is vital. In this paper, we first introduce and elaborate on the problem of training data selection for GNNs. More specifically, focusing on node classification, we aim to select representative nodes from a graph used to train GNNs to achieve the best performance. To solve this problem, we are inspired by the popular lottery ticket hypothesis, typically used for sparse architectures, and we propose the following subset hypothesis for graph data: "There exists a core subset when selecting a fixed-size dataset from the dense training dataset, that can represent the properties of the dataset, and GNNs trained on this core subset can achieve a better graph representation". Equipped with this subset hypothesis, we present an efficient algorithm to identify the core data in the graph for GNNs. Extensive experiments demonstrate that the selected data (as a training set) can obtain performance improvements across various datasets and GNNs architectures.
图神经网络(GNN)已经成为半监督图表示学习的一种流行方法。GNN 研究通常侧重于改进方法细节,而对探索数据标记的重要性关注较少。然而,对于半监督学习,训练数据的质量至关重要。在本文中,我们首先介绍和详细阐述了 GNN 训练数据选择的问题。更具体地说,我们专注于节点分类,旨在从用于训练 GNN 的图中选择有代表性的节点,以实现最佳性能。为了解决这个问题,我们受到了彩票假说的启发,该假说通常用于稀疏架构,我们提出了以下图数据的子集假说:“从密集训练数据集中选择固定大小的数据集时,存在一个核心子集,该子集可以代表数据集的性质,并且在该核心子集上训练的 GNN 可以获得更好的图表示”。有了这个子集假说,我们提出了一种有效的算法来识别 GNN 图中的核心数据。广泛的实验表明,所选数据(作为训练集)可以在各种数据集和 GNN 架构上提高性能。