Vander Meersche Yann, Duval Gabriel, Cretin Gabriel, Gheeraert Aria, Gelly Jean-Christophe, Galochkina Tatiana
Université Paris Cité and Université des Antilles and Université de la Réunion, INSERM, BIGR, DSIMB, Paris, France.
Protein Sci. 2025 Aug;34(8):e70221. doi: 10.1002/pro.70221.
Protein flexibility is essential to its biological function. However, experimental methods for its assessment, such as X-ray crystallography and nuclear magnetic resonance spectroscopy, are often limited by experimental variability and high cost, leading to a gap between the number of identified protein sequences and the available experimental information on protein dynamics. On the other hand, molecular dynamics (MD) simulations provide a uniform and detailed description of the expected protein flexibility, and the availability and quality of such data are increasing significantly during the last years. In this study, we use the recently released ATLAS database to develop ProtEin lanGuAge models for prediction of SimUlated dynamicS (PEGASUS), a sequence-based predictor of MD-derived information on protein flexibility (https://dsimb.inserm.fr/PEGASUS). PEGASUS integrates four different representations of protein sequences generated by Protein Language Models to predict residue-wise MD-derived values of backbone fluctuation (root mean square fluctuation), Phi and Psi dihedral angles standard deviation, and average Local Distance Difference Test across the trajectory. The PEGASUS web server was optimized to perform instantaneous predictions for an individual protein sequence and also allows batch submission of up to 100 sequences of 1 k residues each. For more complex queries, we also release PEGASUS as a user-friendly standalone utility (https://github.com/DSIMB/PEGASUS).
蛋白质的灵活性对其生物学功能至关重要。然而,用于评估蛋白质灵活性的实验方法,如X射线晶体学和核磁共振光谱法,常常受到实验变异性和高成本的限制,导致已识别的蛋白质序列数量与现有的关于蛋白质动力学的实验信息之间存在差距。另一方面,分子动力学(MD)模拟提供了对预期蛋白质灵活性的统一且详细的描述,并且在过去几年中,此类数据的可用性和质量正在显著提高。在本研究中,我们使用最近发布的ATLAS数据库来开发用于预测模拟动力学的蛋白质语言模型(PEGASUS),这是一种基于序列的预测器,用于预测MD衍生的关于蛋白质灵活性的信息(https://dsimb.inserm.fr/PEGASUS)。PEGASUS整合了由蛋白质语言模型生成的四种不同的蛋白质序列表示形式,以预测基于残基的MD衍生值,包括主链波动(均方根波动)、Phi和Psi二面角标准差以及整个轨迹的平均局部距离差异测试。PEGASUS网络服务器经过优化,可对单个蛋白质序列进行即时预测,并且还允许批量提交多达100个序列,每个序列最多1000个残基。对于更复杂的查询,我们还将PEGASUS作为一个用户友好的独立实用程序发布(https://github.com/DSIMB/PEGASUS)。