Suppr超能文献

一种基于ARM可扩展向量扩展(SVE)的快速矢量化排序实现。

A fast vectorized sorting implementation based on the ARM scalable vector extension (SVE).

作者信息

Bramas Bérenger

机构信息

CAMUS, Inria Nancy - Grand Est, Nancy, France.

ICPS Team, ICube, Illkirch-Graffenstaden, France.

出版信息

PeerJ Comput Sci. 2021 Nov 19;7:e769. doi: 10.7717/peerj-cs.769. eCollection 2021.

Abstract

The way developers implement their algorithms and how these implementations behave on modern CPUs are governed by the design and organization of these. The vectorization units (SIMD) are among the few CPUs' parts that can and must be explicitly controlled. In the HPC community, the x86 CPUs and their vectorization instruction sets were de-facto the standard for decades. Each new release of an instruction set was usually a doubling of the vector length coupled with new operations. Each generation was pushing for adapting and improving previous implementations. The release of the ARM scalable vector extension (SVE) changed things radically for several reasons. First, we expect ARM processors to equip many supercomputers in the next years. Second, SVE's interface is different in several aspects from the x86 extensions as it provides different instructions, uses a predicate to control most operations, and has a vector size that is only known at execution time. Therefore, using SVE opens new challenges on how to adapt algorithms including the ones that are already well-optimized on x86. In this paper, we port a hybrid sort based on the well-known Quicksort and Bitonic-sort algorithms. We use a Bitonic sort to process small partitions/arrays and a vectorized partitioning implementation to divide the partitions. We explain how we use the predicates and how we manage the non-static vector size. We also explain how we efficiently implement the sorting kernels. Our approach only needs an array of O(log N) for the recursive calls in the partitioning phase, both in the sequential and in the parallel case. We test the performance of our approach on a modern ARMv8.2 (A64FX) CPU and assess the different layers of our implementation by sorting/partitioning integers, double floating-point numbers, and key/value pairs of integers. Our results show that our approach is faster than the GNU C++ sort algorithm by a speedup factor of 4 on average.

摘要

开发人员实现其算法的方式以及这些实现在现代CPU上的表现,受其设计和组织的支配。向量化单元(SIMD)是少数几个可以且必须被明确控制的CPU部件。在高性能计算(HPC)领域,几十年来x86 CPU及其向量化指令集实际上一直是标准。指令集的每次新版本发布通常都是向量长度翻倍并伴有新操作。每一代都在推动对先前实现进行调整和改进。ARM可扩展向量扩展(SVE)的发布从几个方面彻底改变了局面。首先,我们预计未来几年ARM处理器将装备许多超级计算机。其次,SVE的接口在几个方面与x86扩展不同,因为它提供不同的指令,使用谓词来控制大多数操作,并且向量大小仅在执行时才知道。因此,使用SVE为如何调整算法带来了新挑战,包括那些在x86上已经优化得很好的算法。在本文中,我们移植了一种基于著名的快速排序和双调排序算法的混合排序。我们使用双调排序来处理小分区/数组,并使用向量化分区实现来划分分区。我们解释了如何使用谓词以及如何管理非静态向量大小。我们还解释了如何高效地实现排序内核。我们的方法在分区阶段的递归调用中,无论是顺序情况还是并行情况,只需要一个O(log N)的数组。我们在现代ARMv8.2(A64FX)CPU上测试了我们方法的性能,并通过对整数、双精度浮点数以及整数键/值对进行排序/分区来评估我们实现的不同层次。我们的结果表明,我们的方法平均比GNU C++排序算法快4倍。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/ad1f/8627236/a74666a379de/peerj-cs-07-769-g001.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验