Vitalis Andreas, Winkler Steffen, Zhang Yang, Widmer Julian, Caflisch Amedeo
Department of Biochemistry, University of Zurich, Winterthurerstr. 190, 8057 Zurich, Switzerland.
J Chem Inf Model. 2025 Mar 10;65(5):2443-2455. doi: 10.1021/acs.jcim.4c01301. Epub 2025 Feb 20.
Simulation studies of molecules primarily produce data that represent the configuration of the system as a function of the progress variable, usually time. Because of the high-dimensional nature of these data, which grow very quickly, compromises are often necessary and achieved by storing only a subset of the system's components, for example, stripping solvent, and by restricting the time resolution to a scale significantly coarser than the basic time step of the simulation. The resultant trajectories thus describe the essentially stochastic evolution of the molecules of interest. Maintaining their interpretability through metadata is of interest not only because they can aid researchers interested in specific systems but also for reproducibility studies and model refinement. Here, we introduce a standard for the storage of data created by molecular simulations that improves compliance with the FAIR (Findable, Accessible, Interoperable, and Reusable) principles. We describe a solution conceived in PostgreSQL, along with reference implementations, that provides stringent links between metadata and raw data, which is a major weakness of the established file formats used for storing these data. A possible structure for the logic of SQL queries is included along with salient performance testing. To close, we suggest that a PostgreSQL-based storage of simulation data, in particular when coupled to a visual user interface, can improve the FAIR compliance of molecular simulation data at all levels of visibility, and a prototype solution for accomplishing this is presented.
分子模拟研究主要产生的数据表示系统构型随进程变量(通常是时间)的变化情况。由于这些数据具有高维性且增长迅速,往往需要做出妥协,通过仅存储系统组件的一个子集(例如去除溶剂)以及将时间分辨率限制在比模拟的基本时间步长粗得多的尺度上来实现。因此,所得轨迹描述了感兴趣分子的本质随机演化。通过元数据保持其可解释性不仅因为这有助于对特定系统感兴趣的研究人员,还因为可用于再现性研究和模型优化。在此,我们引入一种用于存储分子模拟产生的数据的标准,该标准提高了对FAIR(可查找、可访问、可互操作和可重用)原则的遵循程度。我们描述了一种在PostgreSQL中构思的解决方案以及参考实现,该方案在元数据和原始数据之间提供了严格的链接,而这是用于存储这些数据的现有文件格式的一个主要弱点。文中还包含SQL查询逻辑的一种可能结构以及显著的性能测试。最后,我们建议基于PostgreSQL的模拟数据存储,特别是与可视化用户界面结合时,可以在各个可见性级别上提高分子模拟数据的FAIR合规性,并展示了实现此目的的一个原型解决方案。