Janson Giacomo, Jussupow Alexander, Feig Michael
Department of Biochemistry and Molecular Biology, Michigan State University, East Lansing, MI 48824, USA.
bioRxiv. 2025 Mar 13:2025.03.09.642148. doi: 10.1101/2025.03.09.642148.
Deep learning has revolutionized protein structure prediction, but capturing conformational ensembles and structural variability remains an open challenge. While molecular dynamics (MD) is the foundation method for simulating biomolecular dynamics, it is computationally expensive. Recently, deep learning models trained on MD have made progress in generating structural ensembles at reduced cost. However, they remain limited in modeling atomistic details and, crucially, incorporating the effect of environmental factors. Here, we present aSAM (atomistic structural autoencoder model), a latent diffusion model trained on MD to generate heavy atom protein ensembles. Unlike most methods, aSAM models atoms in a latent space, greatly facilitating accurate sampling of side chain and backbone torsion angle distributions. Additionally, we extended aSAM into the first reported transferable generator conditioned on temperature, named aSAMt. Trained on the large and open mdCATH dataset, aSAMt captures temperature-dependent ensemble properties and demonstrates generalization beyond training temperatures. By comparing aSAMt ensembles to long MD simulations of fast folding proteins, we find that high-temperature training enhances the ability of deep generators to explore energy landscapes. Finally, we also show that our MD-based aSAMt can already capture experimentally observed thermal behavior of proteins. Our work is a step towards generalizable ensemble generation to complement physics-based approaches.
深度学习彻底改变了蛋白质结构预测,但捕捉构象集合和结构变异性仍然是一个悬而未决的挑战。虽然分子动力学(MD)是模拟生物分子动力学的基础方法,但其计算成本很高。最近,基于MD训练的深度学习模型在以降低的成本生成结构集合方面取得了进展。然而,它们在对原子细节建模以及关键地纳入环境因素的影响方面仍然存在局限性。在这里,我们提出了aSAM(原子结构自动编码器模型),这是一种基于MD训练的潜在扩散模型,用于生成重原子蛋白质集合。与大多数方法不同,aSAM在潜在空间中对原子进行建模,极大地促进了侧链和主链扭转角分布的准确采样。此外,我们将aSAM扩展为第一个报道的以温度为条件的可转移生成器,名为aSAMt。在大型开放的mdCATH数据集上进行训练,aSAMt捕捉了温度依赖性集合特性,并展示了超越训练温度的泛化能力。通过将aSAMt集合与快速折叠蛋白质的长时间MD模拟进行比较,我们发现高温训练增强了深度生成器探索能量景观的能力。最后,我们还表明,我们基于MD的aSAMt已经可以捕捉实验观察到的蛋白质热行为。我们的工作是朝着可泛化的集合生成迈出的一步,以补充基于物理的方法。