Punjab University College of Information Technology, University of the Punjab, Lahore, Pakistan.
PLoS One. 2017 Dec 6;12(12):e0188721. doi: 10.1371/journal.pone.0188721. eCollection 2017.
Nowadays, a typical processor may have multiple processing cores on a single chip. Furthermore, a special purpose processing unit called Graphic Processing Unit (GPU), originally designed for 2D/3D games, is now available for general purpose use in computers and mobile devices. However, the traditional programming languages which were designed to work with machines having single core CPUs, cannot utilize the parallelism available on multi-core processors efficiently. Therefore, to exploit the extraordinary processing power of multi-core processors, researchers are working on new tools and techniques to facilitate parallel programming. To this end, languages like CUDA and OpenCL have been introduced, which can be used to write code with parallelism. The main shortcoming of these languages is that programmer needs to specify all the complex details manually in order to parallelize the code across multiple cores. Therefore, the code written in these languages is difficult to understand, debug and maintain. Furthermore, to parallelize legacy code can require rewriting a significant portion of code in CUDA or OpenCL, which can consume significant time and resources. Thus, the amount of parallelism achieved is proportional to the skills of the programmer and the time spent in code optimizations. This paper proposes a new open source compiler, Rubus, to achieve seamless parallelism. The Rubus compiler relieves the programmer from manually specifying the low-level details. It analyses and transforms a sequential program into a parallel program automatically, without any user intervention. This achieves massive speedup and better utilization of the underlying hardware without a programmer's expertise in parallel programming. For five different benchmarks, on average a speedup of 34.54 times has been achieved by Rubus as compared to Java on a basic GPU having only 96 cores. Whereas, for a matrix multiplication benchmark the average execution speedup of 84 times has been achieved by Rubus on the same GPU. Moreover, Rubus achieves this performance without drastically increasing the memory footprint of a program.
如今,一个典型的处理器可能在单个芯片上具有多个处理核心。此外,最初为 2D/3D 游戏设计的专用处理单元,即图形处理单元(GPU),现在也可用于计算机和移动设备的通用用途。然而,为具有单核 CPU 的机器设计的传统编程语言,无法有效地利用多核处理器的并行性。因此,为了利用多核处理器的非凡处理能力,研究人员正在研究新的工具和技术,以促进并行编程。为此,引入了 CUDA 和 OpenCL 等语言,可以使用这些语言编写具有并行性的代码。这些语言的主要缺点是,程序员需要手动指定所有复杂细节,以便将代码并行化到多个核心。因此,用这些语言编写的代码难以理解、调试和维护。此外,要并行化遗留代码可能需要用 CUDA 或 OpenCL 重写代码的很大一部分,这可能会消耗大量的时间和资源。因此,实现的并行度与程序员的技能和代码优化所花费的时间成正比。本文提出了一种新的开源编译器 Rubus,以实现无缝并行化。Rubus 编译器无需程序员手动指定低级细节,即可自动将顺序程序分析和转换为并行程序。无需程序员具备并行编程方面的专业知识,即可实现大规模加速和更好地利用底层硬件。对于五个不同的基准测试,与 Java 相比,Rubus 在仅有 96 个核心的基本 GPU 上平均实现了 34.54 倍的加速。而对于矩阵乘法基准测试,Rubus 在相同的 GPU 上平均实现了 84 倍的执行速度提升。此外,Rubus 实现了这一性能,而程序的内存占用并没有大幅增加。