Ruiz-Rohena Kristalys, Rodriguez-Martínez Manuel
Department of Electrical and Computer Engineering, University of Puerto Rico, Mayagüez.
Department of Computer Science and Engineering, University of Puerto Rico, Mayagüez.
IEEE Int Conf Cloud Comput. 2024 Jul;2024:42-53. doi: 10.1109/cloud62652.2024.00015. Epub 2024 Aug 28.
Modern enterprises rely on data management systems to collect, store, and analyze vast amounts of data related to their operations. Nowadays, clusters and hardware accelerators (e.g., GPUs, TPUs) have become a necessity to scale with the data processing demands in many applications related to social media, bioinformatics, surveillance systems, remote sensing, and medical informatics. Given this new scenario, the architecture of data analytics engines must evolve to take advantage of these new technological trends. In this paper, we present ArcaDB: a disaggregated query engine that leverages container technology to place operators at compute nodes that fit their performance profile. In ArcaDB, a query plan is dispatched to worker nodes that have different computing characteristics. Each operator is annotated with the preferred type of compute node for execution, and ArcaDB ensures that the operator gets picked up by the appropriate workers. We have implemented a prototype version of ArcaDB using Java, Python, and Docker containers. We have also completed a preliminary performance study of this prototype, using images and scientific data. This study shows that ArcaDB can speed up query performance by a factor of 3.5x in comparison with a shared-nothing, symmetric arrangement.
现代企业依赖数据管理系统来收集、存储和分析与运营相关的大量数据。如今,集群和硬件加速器(如GPU、TPU)已成为满足社交媒体、生物信息学、监控系统、遥感和医学信息学等众多应用中数据处理需求的必要条件。在这种新情况下,数据分析引擎的架构必须不断演进,以利用这些新的技术趋势。在本文中,我们介绍了ArcaDB:一种分布式查询引擎,它利用容器技术将操作符放置在符合其性能配置文件的计算节点上。在ArcaDB中,查询计划被调度到具有不同计算特性的工作节点。每个操作符都被标注了执行所需的首选计算节点类型,并且ArcaDB确保该操作符能被合适的工作节点选中。我们使用Java、Python和Docker容器实现了ArcaDB的原型版本。我们还使用图像和科学数据对该原型进行了初步性能研究。该研究表明,与无共享对称架构相比,ArcaDB可将查询性能提高3.5倍。