• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

一种基于ARM可扩展向量扩展(SVE)的快速矢量化排序实现。

A fast vectorized sorting implementation based on the ARM scalable vector extension (SVE).

作者信息

Bramas Bérenger

机构信息

CAMUS, Inria Nancy - Grand Est, Nancy, France.

ICPS Team, ICube, Illkirch-Graffenstaden, France.

出版信息

PeerJ Comput Sci. 2021 Nov 19;7:e769. doi: 10.7717/peerj-cs.769. eCollection 2021.

DOI:10.7717/peerj-cs.769
PMID:34901427
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC8627236/
Abstract

The way developers implement their algorithms and how these implementations behave on modern CPUs are governed by the design and organization of these. The vectorization units (SIMD) are among the few CPUs' parts that can and must be explicitly controlled. In the HPC community, the x86 CPUs and their vectorization instruction sets were de-facto the standard for decades. Each new release of an instruction set was usually a doubling of the vector length coupled with new operations. Each generation was pushing for adapting and improving previous implementations. The release of the ARM scalable vector extension (SVE) changed things radically for several reasons. First, we expect ARM processors to equip many supercomputers in the next years. Second, SVE's interface is different in several aspects from the x86 extensions as it provides different instructions, uses a predicate to control most operations, and has a vector size that is only known at execution time. Therefore, using SVE opens new challenges on how to adapt algorithms including the ones that are already well-optimized on x86. In this paper, we port a hybrid sort based on the well-known Quicksort and Bitonic-sort algorithms. We use a Bitonic sort to process small partitions/arrays and a vectorized partitioning implementation to divide the partitions. We explain how we use the predicates and how we manage the non-static vector size. We also explain how we efficiently implement the sorting kernels. Our approach only needs an array of O(log N) for the recursive calls in the partitioning phase, both in the sequential and in the parallel case. We test the performance of our approach on a modern ARMv8.2 (A64FX) CPU and assess the different layers of our implementation by sorting/partitioning integers, double floating-point numbers, and key/value pairs of integers. Our results show that our approach is faster than the GNU C++ sort algorithm by a speedup factor of 4 on average.

摘要

开发人员实现其算法的方式以及这些实现在现代CPU上的表现,受其设计和组织的支配。向量化单元(SIMD)是少数几个可以且必须被明确控制的CPU部件。在高性能计算(HPC)领域,几十年来x86 CPU及其向量化指令集实际上一直是标准。指令集的每次新版本发布通常都是向量长度翻倍并伴有新操作。每一代都在推动对先前实现进行调整和改进。ARM可扩展向量扩展(SVE)的发布从几个方面彻底改变了局面。首先,我们预计未来几年ARM处理器将装备许多超级计算机。其次,SVE的接口在几个方面与x86扩展不同,因为它提供不同的指令,使用谓词来控制大多数操作,并且向量大小仅在执行时才知道。因此,使用SVE为如何调整算法带来了新挑战,包括那些在x86上已经优化得很好的算法。在本文中,我们移植了一种基于著名的快速排序和双调排序算法的混合排序。我们使用双调排序来处理小分区/数组,并使用向量化分区实现来划分分区。我们解释了如何使用谓词以及如何管理非静态向量大小。我们还解释了如何高效地实现排序内核。我们的方法在分区阶段的递归调用中,无论是顺序情况还是并行情况,只需要一个O(log N)的数组。我们在现代ARMv8.2(A64FX)CPU上测试了我们方法的性能,并通过对整数、双精度浮点数以及整数键/值对进行排序/分区来评估我们实现的不同层次。我们的结果表明,我们的方法平均比GNU C++排序算法快4倍。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/ad1f/8627236/36a060fbb72d/peerj-cs-07-769-g007.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/ad1f/8627236/a74666a379de/peerj-cs-07-769-g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/ad1f/8627236/6b6c30aae8b6/peerj-cs-07-769-g002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/ad1f/8627236/606705b73040/peerj-cs-07-769-g003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/ad1f/8627236/ef1c7c01e581/peerj-cs-07-769-g004.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/ad1f/8627236/4ed8ff946264/peerj-cs-07-769-g005.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/ad1f/8627236/61e68b28c45f/peerj-cs-07-769-g006.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/ad1f/8627236/36a060fbb72d/peerj-cs-07-769-g007.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/ad1f/8627236/a74666a379de/peerj-cs-07-769-g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/ad1f/8627236/6b6c30aae8b6/peerj-cs-07-769-g002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/ad1f/8627236/606705b73040/peerj-cs-07-769-g003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/ad1f/8627236/ef1c7c01e581/peerj-cs-07-769-g004.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/ad1f/8627236/4ed8ff946264/peerj-cs-07-769-g005.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/ad1f/8627236/61e68b28c45f/peerj-cs-07-769-g006.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/ad1f/8627236/36a060fbb72d/peerj-cs-07-769-g007.jpg

相似文献

1
A fast vectorized sorting implementation based on the ARM scalable vector extension (SVE).一种基于ARM可扩展向量扩展(SVE)的快速矢量化排序实现。
PeerJ Comput Sci. 2021 Nov 19;7:e769. doi: 10.7717/peerj-cs.769. eCollection 2021.
2
Porting and Optimizing BWA-MEM2 Using the Fujitsu A64FX Processor.使用富士通A64FX处理器移植和优化BWA-MEM2
IEEE/ACM Trans Comput Biol Bioinform. 2023 Sep-Oct;20(5):3139-3153. doi: 10.1109/TCBB.2023.3264514. Epub 2023 Oct 9.
3
Parallel Multi-Deque Partition Dual-Deque Merge sorting algorithm using OpenMP.使用 OpenMP 的并行多双端队列分区双端队列归并排序算法。
Sci Rep. 2023 Apr 19;13(1):6408. doi: 10.1038/s41598-023-33583-4.
4
Efficient Scalable Median Filtering Using Histogram-Based Operations.基于直方图操作的高效可扩展中值滤波。
IEEE Trans Image Process. 2018 May;27(5):2217-2228. doi: 10.1109/TIP.2017.2781375.
5
Computing the sparse matrix vector product using block-based kernels without zero padding on processors with AVX-512 instructions.在具有AVX-512指令的处理器上,使用基于块的内核计算稀疏矩阵向量积且不进行零填充。
PeerJ Comput Sci. 2018 Apr 30;4:e151. doi: 10.7717/peerj-cs.151. eCollection 2018.
6
Generic accelerated sequence alignment in SeqAn using vectorization and multi-threading.使用矢量化和多线程在 SeqAn 中进行通用加速序列比对。
Bioinformatics. 2018 Oct 15;34(20):3437-3445. doi: 10.1093/bioinformatics/bty380.
7
SWPS3 - fast multi-threaded vectorized Smith-Waterman for IBM Cell/B.E. and x86/SSE2.SWPS3 - 适用于IBM Cell/B.E.和x86/SSE2的快速多线程矢量化史密斯-沃特曼算法
BMC Res Notes. 2008 Oct 29;1:107. doi: 10.1186/1756-0500-1-107.
8
Coupling SIMD and SIMT architectures to boost performance of a phylogeny-aware alignment kernel.将 SIMD 和 SIMT 架构进行耦合以提高具有系统发育感知的对齐核的性能。
BMC Bioinformatics. 2012 Aug 9;13:196. doi: 10.1186/1471-2105-13-196.
9
Fast Approximations of Activation Functions in Deep Neural Networks when using Posit Arithmetic.当使用正算数时,深度神经网络中激活函数的快速逼近。
Sensors (Basel). 2020 Mar 10;20(5):1515. doi: 10.3390/s20051515.
10
Cache-Oblivious parallel SIMD Viterbi decoding for sequence search in HMMER.用于 HMMER 中序列搜索的无高速缓存感知并行 SIMD Viterbi 解码。
BMC Bioinformatics. 2014 May 30;15:165. doi: 10.1186/1471-2105-15-165.