Department of Biochemistry and Molecular Biology, Michigan State University, East Lansing, MI, 48824, USA.
Nat Commun. 2023 Feb 11;14(1):774. doi: 10.1038/s41467-023-36443-x.
Dynamics and conformational sampling are essential for linking protein structure to biological function. While challenging to probe experimentally, computer simulations are widely used to describe protein dynamics, but at significant computational costs that continue to limit the systems that can be studied. Here, we demonstrate that machine learning can be trained with simulation data to directly generate physically realistic conformational ensembles of proteins without the need for any sampling and at negligible computational cost. As a proof-of-principle we train a generative adversarial network based on a transformer architecture with self-attention on coarse-grained simulations of intrinsically disordered peptides. The resulting model, idpGAN, can predict sequence-dependent coarse-grained ensembles for sequences that are not present in the training set demonstrating that transferability can be achieved beyond the limited training data. We also retrain idpGAN on atomistic simulation data to show that the approach can be extended in principle to higher-resolution conformational ensemble generation.
动力学和构象采样对于将蛋白质结构与生物功能联系起来至关重要。虽然实验探测具有挑战性,但计算机模拟被广泛用于描述蛋白质动力学,但计算成本很高,这仍然限制了可以研究的系统。在这里,我们证明可以使用模拟数据对机器学习进行训练,以直接生成蛋白质的物理上真实的构象集合,而无需任何采样,并且计算成本可以忽略不计。作为原理验证,我们使用基于带有自注意力的转换器架构的生成式对抗网络对内在无序肽的粗粒度模拟进行训练。由此产生的模型 idpGAN 可以预测不在训练集中的序列的依赖于序列的粗粒度集合,这表明可以在有限的训练数据之外实现可转移性。我们还在原子模拟数据上重新训练 idpGAN,以表明该方法原则上可以扩展到更高分辨率的构象集合生成。