Chi Yuze, Cong Jason
University of California, Los Angeles.
Proc Des Autom Conf. 2020 Jul;2020. doi: 10.1109/dac18072.2020.9218680. Epub 2020 Oct 9.
Stencil kernel is an important type of kernel used extensively in many application domains. Over the years, researchers have been studying the optimizations on parallelization, communication reuse, and computation reuse for various target platforms. However, challenges still exist, especially on the computation reuse problem for accelerators, due to the lack of complete design-space exploration and effective design-space pruning. In this paper, we present solutions to the above challenges for a wide range of stencil kernels (i.e., stencil with reduction operations), where the computation reuse patterns are extremely flexible due to the commutative and associative properties. We formally define the complete design space, based on which we present a provably optimal dynamic programming algorithm and a heuristic beam search algorithm that provides near-optimal solutions under an architecture-aware model. Experimental results show that for synthesizing stencil kernels to FPGAs, compared with state-of-the-art stencil compiler without computation reuse capability, our proposed algorithm can reduce the look-up table (LUT) and digital signal processor (DSP) usage by 58.1% and 54.6% on average respectively, which leads to an average speedup of 2.3× for compute-intensive kernels, outperforming the latest CPU/GPU results.
模板内核是一种重要的内核类型,在许多应用领域中被广泛使用。多年来,研究人员一直在研究针对各种目标平台的并行化、通信复用和计算复用优化。然而,挑战仍然存在,特别是在加速器的计算复用问题上,因为缺乏完整的设计空间探索和有效的设计空间剪枝。在本文中,我们针对广泛的模板内核(即带有归约操作的模板)提出了上述挑战的解决方案,其中由于交换律和结合律,计算复用模式极其灵活。我们正式定义了完整的设计空间,并在此基础上提出了一种可证明最优的动态规划算法和一种启发式束搜索算法,该算法在架构感知模型下提供接近最优的解决方案。实验结果表明,对于将模板内核综合到现场可编程门阵列(FPGA)中,与没有计算复用能力的最新模板编译器相比,我们提出的算法平均可分别将查找表(LUT)和数字信号处理器(DSP)的使用量减少58.1%和54.6%,这使得计算密集型内核的平均加速比达到2.3倍,优于最新的中央处理器(CPU)/图形处理器(GPU)结果。