Martino Sam Alexander, Morado João, Li Chenghao, Lu Zhenghao, Rosta Edina
Department of Physics and Astronomy, University College London, London WC1E 6BT, U.K.
J Phys Chem B. 2024 Aug 29;128(34):8103-8115. doi: 10.1021/acs.jpcb.3c08213. Epub 2024 Aug 15.
The recent trend in using network and graph structures to represent a variety of different data types has renewed interest in the graph partitioning (GP) problem. This interest stems from the need for general methods that can both efficiently identify network communities and reduce the dimensionality of large graphs while satisfying various application-specific criteria. Traditional clustering algorithms often struggle to capture the complex relationships within graphs and generalize to arbitrary clustering criteria. The emergence of graph neural networks (GNNs) as a powerful framework for learning representations of graph data provides new approaches to solving the problem. Previous work has shown GNNs to be capable of proposing partitionings using a variety of criteria. However, these approaches have not yet been extended to Markov chains or kinetic networks. These arise frequently in the study of molecular systems and are of particular interest to the biomolecular modeling community. In this work, we propose several GNN-based architectures to tackle the GP problem for Markov Chains described as kinetic networks. This approach aims to maximize the Kemeny constant, which is a variational quantity and it represents the sum of time scales of the system. We propose using an encoder-decoder architecture and show how simple GraphSAGE-based GNNs with linear layers can outperform much larger and more expressive attention-based models in this context. As a proof of concept, we first demonstrate the method's ability to cluster randomly connected graphs. We also use a linear chain architecture corresponding to a 1D free energy profile as our kinetic network. Subsequently, we demonstrate the effectiveness of our method through experiments on a data set derived from molecular dynamics. We compare the performance of our method to other partitioning techniques, such as PCCA+. We explore the importance of feature and hyperparameter selection and propose a general strategy for large-scale parallel training of GNNs for discovering optimal graph partitionings.
最近,使用网络和图结构来表示各种不同数据类型的趋势重新激发了人们对图划分(GP)问题的兴趣。这种兴趣源于对通用方法的需求,这些方法既能有效地识别网络社区,又能在满足各种特定应用标准的同时降低大图的维度。传统的聚类算法往往难以捕捉图中的复杂关系,也难以推广到任意的聚类标准。图神经网络(GNN)作为一种强大的图数据表示学习框架的出现,为解决该问题提供了新的方法。先前的工作表明,GNN能够使用各种标准提出划分方法。然而,这些方法尚未扩展到马尔可夫链或动力学网络。马尔可夫链和动力学网络在分子系统研究中经常出现,并且受到生物分子建模社区的特别关注。在这项工作中,我们提出了几种基于GNN的架构,以解决描述为动力学网络的马尔可夫链的GP问题。这种方法旨在最大化凯梅尼常数,它是一个变分量,代表系统的时间尺度之和。我们提出使用编码器 - 解码器架构,并展示了在此背景下,简单的带有线性层的基于GraphSAGE的GNN如何优于更大且更具表现力的基于注意力的模型。作为概念验证,我们首先展示了该方法对随机连接图进行聚类的能力。我们还使用对应于一维自由能分布的线性链架构作为我们的动力学网络。随后,我们通过对来自分子动力学的数据集进行实验来证明我们方法的有效性。我们将我们方法的性能与其他划分技术(如PCCA +)进行比较。我们探讨了特征和超参数选择的重要性,并提出了一种用于大规模并行训练GNN以发现最优图划分的通用策略。