Kumar Anand, Grupcev Vladimir, Berrada Meryem, Fogarty Joseph C, Tu Yi-Cheng, Zhu Xingquan, Pandit Sagar A, Xia Yuni
Department of Computer Science and Engineering, University of South Florida, 4202 E. Fowler Ave., ENB118, Tampa, 33620 Florida USA.
Department of Physics, University of South Florida, 4202 E. Fowler Ave., PHY114, Tampa, 33620 Florida USA.
J Big Data. 2015;2(1):9. doi: 10.1186/s40537-014-0009-5. Epub 2014 Nov 26.
Molecular Simulation (MS) is a powerful tool for studying physical/chemical features of large systems and has seen applications in many scientific and engineering domains. During the simulation process, the experiments generate a very large number of atoms and intend to observe their spatial and temporal relationships for scientific analysis. The sheer data volumes and their intensive interactions impose significant challenges for data accessing, managing, and analysis. To date, existing MS software systems fall short on storage and handling of MS data, mainly because of the missing of a platform to support applications that involve intensive data access and analytical process. In this paper, we present the database-centric molecular simulation (DCMS) system our team developed in the past few years. The main idea behind DCMS is to store MS data in a relational database management system (DBMS) to take advantage of the declarative query interface (, SQL), data access methods, query processing, and optimization mechanisms of modern DBMSs. A unique challenge is to handle the analytical queries that are often compute-intensive. For that, we developed novel indexing and query processing strategies (including algorithms running on modern co-processors) as integrated components of the DBMS. As a result, researchers can upload and analyze their data using efficient functions implemented inside the DBMS. Index structures are generated to store analysis results that may be interesting to other users, so that the results are readily available without duplicating the analysis. We have developed a prototype of DCMS based on the PostgreSQL system and experiments using real MS data and workload show that DCMS significantly outperforms existing MS software systems. We also used it as a platform to test other data management issues such as security and compression.
分子模拟(MS)是研究大型系统物理/化学特征的强大工具,已在许多科学和工程领域得到应用。在模拟过程中,实验会生成大量原子,并旨在观察它们的时空关系以进行科学分析。庞大的数据量及其密集的相互作用给数据访问、管理和分析带来了重大挑战。迄今为止,现有的MS软件系统在MS数据的存储和处理方面存在不足,主要是因为缺少一个支持涉及密集数据访问和分析过程的应用程序的平台。在本文中,我们介绍了我们团队在过去几年中开发的以数据库为中心的分子模拟(DCMS)系统。DCMS背后的主要思想是将MS数据存储在关系数据库管理系统(DBMS)中,以利用现代DBMS的声明式查询接口(如SQL)、数据访问方法、查询处理和优化机制。一个独特的挑战是处理通常计算密集型的分析查询。为此,我们开发了新颖的索引和查询处理策略(包括在现代协处理器上运行的算法)作为DBMS的集成组件。结果,研究人员可以使用DBMS内部实现的高效功能上传和分析他们的数据。生成索引结构来存储其他用户可能感兴趣的分析结果,这样就无需重复分析即可随时获得结果。我们基于PostgreSQL系统开发了DCMS的原型,使用真实MS数据和工作负载进行的实验表明,DCMS明显优于现有的MS软件系统。我们还将其用作测试其他数据管理问题(如安全性和压缩)的平台。