Computational Science Laboratory, Universitat Pompeu Fabra, Barcelona Biomedical Research Park (PRBB), Carrer Dr. Aiguader 88, Barcelona, 08003, Spain.
Biophysics Institute, National Research Council (CNR-IBF), Via Celoria 26, Milan, 20133, Italy.
Sci Data. 2024 Nov 28;11(1):1299. doi: 10.1038/s41597-024-04140-z.
Recent advancements in protein structure determination are revolutionizing our understanding of proteins. Still, a significant gap remains in the availability of comprehensive datasets that focus on the dynamics of proteins, which are crucial for understanding protein function, folding, and interactions. To address this critical gap, we introduce mdCATH, a dataset generated through an extensive set of all-atom molecular dynamics simulations of a diverse and representative collection of protein domains. This dataset comprises all-atom systems for 5,398 domains, modeled with a state-of-the-art classical force field, and simulated in five replicates each at five temperatures from 320 K to 450 K. The mdCATH dataset records coordinates and forces every 1 ns, for over 62 ms of accumulated simulation time, effectively capturing the dynamics of the various classes of domains and providing a unique resource for proteome-wide statistical analyses of protein unfolding thermodynamics and kinetics. We outline the dataset structure and showcase its potential through four easily reproducible case studies, highlighting its capabilities in advancing protein science.
近年来,蛋白质结构测定的进展正在彻底改变我们对蛋白质的理解。然而,在可用于全面研究蛋白质动态的综合数据集方面,仍然存在很大的差距,而蛋白质的动态对于理解蛋白质的功能、折叠和相互作用至关重要。为了解决这一关键差距,我们引入了 mdCATH 数据集,该数据集是通过对多样化且具有代表性的蛋白质结构域集合进行广泛的全原子分子动力学模拟生成的。该数据集包含 5,398 个结构域的全原子系统,使用最先进的经典力场进行建模,并在五个温度(320 K 至 450 K)下进行了五重复制模拟。mdCATH 数据集每 1 ns 记录一次坐标和力,累积模拟时间超过 62 ms,有效地捕捉了各种结构域类别的动态,并为蛋白质组范围内对蛋白质解折叠热力学和动力学的统计分析提供了独特的资源。我们概述了数据集的结构,并通过四个易于重现的案例研究展示了其潜力,强调了其在推进蛋白质科学方面的能力。