Airas Justin, Zhang Bin
Department of Chemistry, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139-4307, United States.
J Chem Theory Comput. 2025 Feb 25;21(4):2055-2066. doi: 10.1021/acs.jctc.4c01420. Epub 2025 Feb 6.
Graph neural network (GNN) architectures have emerged as promising force field models, exhibiting high accuracy in predicting complex energies and forces based on atomic identities and Cartesian coordinates. To expand the applicability of GNNs, and machine learning force fields more broadly, optimizing their computational efficiency is critical, especially for large biomolecular systems in classical molecular dynamics simulations. In this study, we address key challenges in existing GNN benchmarks by introducing a dataset, DISPEF, which comprises large, biologically relevant proteins. DISPEF includes 207,454 proteins with sizes up to 12,499 atoms and features diverse chemical environments, spanning folded and disordered regions. The implicit solvation free energies, used as training targets, represent a particularly challenging case due to their many-body nature, providing a stringent test for evaluating the expressiveness of machine learning models. We benchmark the performance of seven GNNs on DISPEF, emphasizing the importance of directly accounting for long-range interactions to enhance model transferability. Additionally, we present a novel multiscale architecture, termed Schake, which delivers transferable and computationally efficient energy and force predictions for large proteins. Our findings offer valuable insights and tools for advancing GNNs in protein modeling applications.
图神经网络(GNN)架构已成为很有前景的力场模型,在基于原子身份和笛卡尔坐标预测复杂能量和力方面表现出高精度。为了更广泛地扩展GNN以及机器学习力场的适用性,优化它们的计算效率至关重要,特别是对于经典分子动力学模拟中的大型生物分子系统。在本研究中,我们通过引入一个数据集DISPEF来应对现有GNN基准测试中的关键挑战,该数据集包含大型的、与生物学相关的蛋白质。DISPEF包括207,454种蛋白质,大小可达12,499个原子,具有多样的化学环境,涵盖折叠和无序区域。用作训练目标的隐式溶剂化自由能因其多体性质而代表了一个特别具有挑战性的情况,为评估机器学习模型的表现力提供了严格的测试。我们在DISPEF上对七种GNN的性能进行基准测试,强调直接考虑长程相互作用以增强模型可转移性的重要性。此外,我们提出了一种新颖的多尺度架构,称为Schake,它能为大型蛋白质提供可转移且计算高效的能量和力预测。我们的研究结果为在蛋白质建模应用中推进GNN提供了有价值的见解和工具。