Janson Giacomo, Feig Michael
Department of Biochemistry and Molecular Biology, Michigan State University, East Lansing, Michigan, USA.
bioRxiv. 2024 Feb 8:2024.02.08.579522. doi: 10.1101/2024.02.08.579522.
Intrinsically disordered proteins have dynamic structures through which they play key biological roles. The elucidation of their conformational ensembles is a challenging problem requiring an integrated use of computational and experimental methods. Molecular simulations are a valuable computational strategy for constructing structural ensembles of disordered proteins but are highly resource-intensive. Recently, machine learning approaches based on deep generative models that learn from simulation data have emerged as an efficient alternative for generating structural ensembles. However, such methods currently suffer from limited transferability when modeling sequences and conformations absent in the training data. Here, we develop a novel generative model that achieves high levels of transferability for intrinsically disordered protein ensembles. The approach, named idpSAM, is a latent diffusion model based on transformer neural networks. It combines an autoencoder to learn a representation of protein geometry and a diffusion model to sample novel conformations in the encoded space. IdpSAM was trained on a large dataset of simulations of disordered protein regions performed with the ABSINTH implicit solvent model. Thanks to the expressiveness of its neural networks and its training stability, idpSAM faithfully captures 3D structural ensembles of test sequences with no similarity in the training set. Our study also demonstrates the potential for generating full conformational ensembles from datasets with limited sampling and underscores the importance of training set size for generalization. We believe that idpSAM represents a significant progress in transferable protein ensemble modeling through machine learning.
内在无序蛋白质具有动态结构,借此发挥关键的生物学作用。阐明其构象集合是一个具有挑战性的问题,需要综合运用计算和实验方法。分子模拟是构建无序蛋白质结构集合的一种有价值的计算策略,但资源消耗极大。最近,基于从模拟数据中学习的深度生成模型的机器学习方法已成为生成结构集合的一种有效替代方法。然而,此类方法在对训练数据中不存在的序列和构象进行建模时,目前存在可转移性有限的问题。在此,我们开发了一种新型生成模型,该模型在内在无序蛋白质集合方面实现了高度的可转移性。这种方法名为idpSAM,是一种基于Transformer神经网络的潜在扩散模型。它结合了一个自动编码器来学习蛋白质几何结构的表示,并结合一个扩散模型在编码空间中采样新的构象。IdpSAM是在使用ABSINTH隐式溶剂模型对无序蛋白质区域进行模拟的大型数据集上进行训练的。由于其神经网络的表现力及其训练稳定性,IdpSAM能够忠实地捕捉测试序列的3D结构集合,而这些测试序列在训练集中没有相似性。我们的研究还展示了从采样有限的数据集中生成完整构象集合的潜力,并强调了训练集大小对于泛化的重要性。我们相信,IdpSAM代表了通过机器学习在可转移蛋白质集合建模方面的重大进展。