Chi Yuze, Guo Licheng, Lau Jason, Choi Young-Kyu, Wang Jie, Cong Jason
University of California, Los Angeles.
Inha University.
Proc Annu IEEE Symp Field Program Cust Comput Mach. 2021 May;2021. doi: 10.1109/fccm51124.2021.00032. Epub 2021 Jun 2.
C/C++/OpenCL-based high-level synthesis (HLS) becomes more and more popular for field-programmable gate array (FPGA) accelerators in many application domains in recent years, thanks to its competitive quality of results (QoR) and short development cycles compared with the traditional register-transfer level design approach. Yet, limited by the sequential C semantics, it remains challenging to adopt the same highly productive high-level programming approach in many other application domains, where coarse-grained tasks run in parallel and communicate with each other at a fine-grained level. While current HLS tools do support task-parallel programs, the productivity is greatly limited ① in the code development cycle due to the poor programmability, ② in the correctness verification cycle due to restricted software simulation, and ③ in the QoR tuning cycle due to slow code generation. Such limited productivity often defeats the purpose of HLS and hinder programmers from adopting HLS for task-parallel FPGA accelerators. In this paper, we extend the HLS C++ language and present a fully automated framework with programmer-friendly interfaces, unconstrained software simulation, and fast hierarchical code generation to overcome these limitations and demonstrate how task-parallel programs can be productively supported in HLS. Experimental results based on a wide range of real-world task-parallel programs show that, on average, the lines of kernel and host code are reduced by 22% and 51%, respectively, which considerably improves the programmability. The correctness verification and the iterative QoR tuning cycles are both greatly shortened by 3.2× and 6.8×, respectively. Our work is open-source at https://github.com/UCLA-VAST/tapa/.
近年来,基于C/C++/OpenCL的高级综合(HLS)在许多应用领域的现场可编程门阵列(FPGA)加速器中越来越受欢迎,这得益于其与传统寄存器传输级设计方法相比具有竞争力的结果质量(QoR)和较短的开发周期。然而,受顺序C语义的限制,在许多其他应用领域采用同样高效的高级编程方法仍然具有挑战性,在这些领域中,粗粒度任务并行运行并在细粒度级别相互通信。虽然当前的HLS工具确实支持任务并行程序,但生产力在以下方面受到极大限制:①在代码开发周期中,由于可编程性差;②在正确性验证周期中,由于软件模拟受限;③在QoR调整周期中,由于代码生成缓慢。这种有限的生产力常常违背了HLS的目的,并阻碍程序员将HLS用于任务并行FPGA加速器。在本文中,我们扩展了HLS C++语言,并提出了一个具有程序员友好接口、无约束软件模拟和快速分层代码生成的全自动框架,以克服这些限制,并展示如何在HLS中高效地支持任务并行程序。基于广泛的实际任务并行程序的实验结果表明,平均而言,内核代码和主机代码的行数分别减少了22%和51%,这大大提高了可编程性。正确性验证和迭代QoR调整周期分别大幅缩短了3.2倍和6.8倍。我们的工作在https://github.com/UCLA-VAST/tapa/上开源。