Visani Gian Marco, Pun Michael N, Angaji Arman, Nourmohammad Armita
Paul G. Allen School of Computer Science and Engineering, University of Washington, 85 E Stevens Way NE, Seattle, Washington 98195, USA.
Department of Physics, University of Washington, 3910 15th Avenue Northeast, Seattle, Washington 98195, USA.
Phys Rev Res. 2024 Apr-Jun;6(2). doi: 10.1103/physrevresearch.6.023006. Epub 2024 Apr 1.
Group-equivariant neural networks have emerged as an efficient approach to model complex data, using generalized convolutions that respect the relevant symmetries of a system. These techniques have made advances in both the supervised learning tasks for classification and regression, and the unsupervised tasks to generate new data. However, little work has been done in leveraging the symmetry-aware expressive representations that could be extracted from these approaches. Here, we present -(variational) autoencoder [H-(V)AE], a fully end-to-end SO(3)-equivariant (variational) autoencoder in Fourier space, suitable for unsupervised learning and generation of data distributed around a specified origin in 3D. H-(V)AE is trained to reconstruct the spherical Fourier encoding of data, learning in the process a low-dimensional representation of the data (i.e., a latent space) with a maximally informative rotationally invariant embedding alongside an equivariant frame describing the orientation of the data. We extensively test the performance of H-(V)AE on diverse datasets. We show that the learned latent space efficiently encodes the categorical features of spherical images. Moreover, the low-dimensional representations learned by H-VAE can be used for downstream data-scarce tasks. Specifically, we show that H-(V)AE's latent space can be used to extract compact embeddings for protein structure microenvironments, and when paired with a random forest regressor, it enables state-of-the-art predictions of protein-ligand binding affinity.
群等变神经网络已成为一种对复杂数据进行建模的有效方法,它使用尊重系统相关对称性的广义卷积。这些技术在分类和回归的监督学习任务以及生成新数据的无监督任务中都取得了进展。然而,在利用可从这些方法中提取的具有对称性感知的表达性表示方面,所做的工作很少。在此,我们提出了(变分)自编码器[H - (V)AE],这是一种在傅里叶空间中完全端到端的SO(3)等变(变分)自编码器,适用于无监督学习以及生成围绕三维中指定原点分布的数据。H - (V)AE经过训练以重建数据的球面傅里叶编码,在此过程中学习数据的低维表示(即潜在空间),同时具有最大信息量的旋转不变嵌入以及描述数据方向的等变框架。我们在各种数据集上广泛测试了H - (V)AE的性能。我们表明,所学习的潜在空间有效地编码了球面图像的分类特征。此外,由H - VAE学习到的低维表示可用于下游数据稀缺的任务。具体而言,我们表明H - (V)AE的潜在空间可用于提取蛋白质结构微环境的紧凑嵌入,并且当与随机森林回归器配对时,它能够对蛋白质 - 配体结合亲和力进行当前最先进的预测。