Jamasb Arian R, Morehead Alex, Joshi Chaitanya K, Zhang Zuobai, Didi Kieran, Mathis Simon, Harris Charles, Tang Jian, Cheng Jianlin, Liò Pietro, Blundell Tom L
University of Cambridge.
University of Missouri.
ArXiv. 2024 Jun 19:arXiv:2406.13864v1.
We introduce , a comprehensive benchmark suite for representation learning on protein structures with Geometric Graph Neural Networks. We consider large-scale pre-training and downstream tasks on both experimental and predicted structures to enable the systematic evaluation of the quality of the learned structural representation and their usefulness in capturing functional relationships for downstream tasks. We find that: (1) large-scale pretraining on AlphaFold structures and auxiliary tasks consistently improve the performance of both rotation-invariant and equivariant GNNs, and (2) more expressive equivariant GNNs benefit from pretraining to a greater extent compared to invariant models. We aim to establish a common ground for the machine learning and computational biology communities to rigorously compare and advance protein structure representation learning. Our open-source codebase reduces the barrier to entry for working with large protein structure datasets by providing: (1) storage-efficient dataloaders for large-scale structural databases including AlphaFoldDB and ESM Atlas, as well as (2) utilities for constructing new tasks from the entire PDB. is available at: github.com/a-r-j/ProteinWorkshop.
我们介绍了ProteinWorkshop,这是一个用于使用几何图神经网络对蛋白质结构进行表示学习的综合基准套件。我们考虑在实验结构和预测结构上进行大规模预训练和下游任务,以便系统地评估所学结构表示的质量及其在捕获下游任务功能关系方面的有用性。我们发现:(1)在AlphaFold结构和辅助任务上进行大规模预训练持续提高了旋转不变和等变GNN的性能,并且(2)与不变模型相比,更具表现力的等变GNN从预训练中受益的程度更大。我们旨在为机器学习和计算生物学社区建立一个共同基础,以严格比较和推进蛋白质结构表示学习。我们的开源代码库通过提供:(1)用于包括AlphaFoldDB和ESM Atlas在内的大规模结构数据库的存储高效数据加载器,以及(2)用于从整个PDB构建新任务的实用工具,降低了处理大型蛋白质结构数据集的入门门槛。ProteinWorkshop可在github.com/a-r-j/ProteinWorkshop获取。