Suppr超能文献

预测与比较数组管理库的性能

Predicting and Comparing the Performance of Array Management Libraries.

作者信息

Kang Donghe, Rübel Oliver, Byna Suren, Blanas Spyros

机构信息

The Ohio State University.

Lawrence Berkeley National Laboratory.

出版信息

Proc IPDPS (Conf). 2020 May;2020:906-915. doi: 10.1109/ipdps47924.2020.00097. Epub 2020 Jul 14.

Abstract

Many applications are increasingly becoming I/O-bound. To improve scalability, analytical models of parallel I/O performance are often consulted to determine possible I/O optimizations. However, I/O performance modeling has predominantly focused on applications that directly issue I/O requests to a parallel file system or a local storage device. These I/O models are not directly usable by applications that access data through standardized I/O libraries, such as HDF5, FITS, and NetCDF, because a single I/O request to an object can trigger a cascade of I/O operations to different storage blocks. The I/O performance characteristics of applications that rely on these libraries is a complex function of the underlying data storage model, user-configurable parameters and object-level access patterns. As a consequence, I/O optimization is predominantly an ad-hoc process that is performed by application developers, who are often domain scientists with limited desire to delve into nuances of the storage hierarchy of modern computers. This paper presents an analytical cost model to predict the end-to-end execution time of applications that perform I/O through established array management libraries. The paper focuses on the HDF5 and Zarr array libraries, as examples of I/O libraries with radically different storage models: HDF5 stores every object in one file, while Zarr creates multiple files to store different objects. We find that accessing array objects via these I/O libraries introduces new overheads and optimizations. Specifically, in addition to I/O time, it is crucial to model the cost of transforming data to a particular storage layout (memory copy cost), as well as model the benefit of accessing a software cache. We evaluate the model on real applications that process observations (neuroscience) and simulation results (plasma physics). The evaluation on three HPC clusters reveals that I/O accounts for as little as 10% of the execution time in some cases, and hence models that only focus on I/O performance cannot accurately capture the performance of applications that use standard array storage libraries. In parallel experiments, our model correctly predicts the fastest storage library between HDF5 and Zarr 94% of the time, in contrast with 70% of the time for a cutting-edge I/O model.

摘要

许多应用程序越来越受I/O限制。为了提高可扩展性,人们常常参考并行I/O性能的分析模型来确定可能的I/O优化措施。然而,I/O性能建模主要集中在那些直接向并行文件系统或本地存储设备发出I/O请求的应用程序上。这些I/O模型不能被通过标准化I/O库(如HDF5、FITS和NetCDF)访问数据的应用程序直接使用,因为对一个对象的单个I/O请求可能会触发一系列到不同存储块的I/O操作。依赖这些库的应用程序的I/O性能特征是基础数据存储模型、用户可配置参数和对象级访问模式的复杂函数。因此,I/O优化主要是一个由应用程序开发人员执行的临时过程,而这些开发人员通常是领域科学家,他们对深入研究现代计算机存储层次结构的细微差别兴趣有限。本文提出了一个分析成本模型,用于预测通过既定数组管理库执行I/O的应用程序的端到端执行时间。本文重点关注HDF5和Zarr数组库,作为具有截然不同存储模型的I/O库的示例:HDF5将每个对象存储在一个文件中,而Zarr创建多个文件来存储不同的对象。我们发现,通过这些I/O库访问数组对象会引入新的开销和优化。具体而言,除了I/O时间外,对将数据转换为特定存储布局的成本(内存复制成本)进行建模以及对访问软件缓存的好处进行建模也至关重要。我们在处理观测数据(神经科学)和模拟结果(等离子体物理学)的实际应用程序上评估了该模型。在三个HPC集群上的评估表明,在某些情况下,I/O占执行时间的比例低至10%,因此仅关注I/O性能的模型无法准确捕捉使用标准数组存储库的应用程序的性能。在并行实验中,我们的模型在94%的时间内正确预测了HDF5和Zarr之间最快的存储库,相比之下,一个前沿I/O模型的预测准确率为70%。

相似文献

1
Predicting and Comparing the Performance of Array Management Libraries.
Proc IPDPS (Conf). 2020 May;2020:906-915. doi: 10.1109/ipdps47924.2020.00097. Epub 2020 Jul 14.
2
Optimizing performance of GATK workflows using Apache Arrow In-Memory data framework.
BMC Genomics. 2020 Nov 18;21(Suppl 10):683. doi: 10.1186/s12864-020-07013-y.
3
Experimental Directory Structure (Exdir): An Alternative to HDF5 Without Introducing a New File Format.
Front Neuroinform. 2018 Apr 13;12:16. doi: 10.3389/fninf.2018.00016. eCollection 2018.
4
Standardizing the next generation of bioinformatics software development with BioHDF (HDF5).
Adv Exp Med Biol. 2010;680:693-700. doi: 10.1007/978-1-4419-5913-3_77.
5
MISS-D: A fast and scalable framework of medical image storage service based on distributed file system.
Comput Methods Programs Biomed. 2020 Apr;186:105189. doi: 10.1016/j.cmpb.2019.105189. Epub 2019 Nov 14.
6
Photon-HDF5: An Open File Format for Timestamp-Based Single-Molecule Fluorescence Experiments.
Biophys J. 2016 Jan 5;110(1):26-33. doi: 10.1016/j.bpj.2015.11.013.
7
Applying neural networks to predict HPC-I/O bandwidth over seismic data on lustre file system for ExSeisDat.
Cluster Comput. 2022;25(4):2661-2682. doi: 10.1007/s10586-021-03347-8. Epub 2021 Jul 2.
8
Analysis-ready VCF at Biobank scale using Zarr.
bioRxiv. 2025 Feb 6:2024.06.11.598241. doi: 10.1101/2024.06.11.598241.
10
Proceedings of the Second Workshop on Theory meets Industry (Erwin-Schrödinger-Institute (ESI), Vienna, Austria, 12-14 June 2007).
J Phys Condens Matter. 2008 Feb 13;20(6):060301. doi: 10.1088/0953-8984/20/06/060301. Epub 2008 Jan 24.

引用本文的文献

2
OME-NGFF: a next-generation file format for expanding bioimaging data-access strategies.
Nat Methods. 2021 Dec;18(12):1496-1498. doi: 10.1038/s41592-021-01326-w. Epub 2021 Nov 29.

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验