IEEE Trans Biomed Circuits Syst. 2021 Apr;15(2):259-269. doi: 10.1109/TBCAS.2021.3064841. Epub 2021 May 25.
Due to the movement expressiveness and privacy assurance of human skeleton data, 3D skeleton-based action inference is becoming popular in healthcare applications. These scenarios call for more advanced performance in application-specific algorithms and efficient hardware support. Warnings on health emergencies sensitive to response speed require low latency output and action early detection capabilities. Medical monitoring that works in an always-on edge platform needs the system processor to have extreme energy efficiency. Therefore, in this paper, we propose the MC-LSTM, a functional and versatile 3D skeleton-based action detection system, for the above demands. Our system achieves state-of-the-art accuracy on trimmed and untrimmed cases of general-purpose and medical-specific datasets with early-detection features. Further, the MC-LSTM accelerator supports parallel inference on up to 64 input channels. The implementation on Xilinx ZCU104 reaches a throughput of 18 658 Frames-Per-Second (FPS) and an inference latency of 3.5 ms with the batch size of 64. Accordingly, the power consumption is 3.6 W for the whole FPGA+ARM system, which is 37.8x and 10.4x more energy-efficient than the high-end Titan X GPU and i7-9700 CPU, respectively. Meanwhile, our accelerator also keeps a 4 ∼ 5x energy efficiency advantage against the low-power high-performance Firefly-RK3399 board carrying an ARM Cortex-A72+A53 CPU. We further synthesize an 8-bit quantized version on the same hardware, providing a 48.8% increase in energy efficiency under the same throughput.
由于人类骨骼数据的运动表现力和隐私保障性,基于 3D 骨骼的动作推断在医疗保健应用中变得越来越流行。这些场景需要在特定于应用的算法和高效的硬件支持方面取得更先进的性能。对响应速度敏感的健康紧急情况的警告需要低延迟输出和早期动作检测能力。在始终开启的边缘平台上运行的医疗监测需要系统处理器具有极高的能效。因此,在本文中,我们提出了 MC-LSTM,这是一种功能强大且通用的基于 3D 骨骼的动作检测系统,用于满足上述需求。我们的系统在通用和医疗特定数据集的修剪和未修剪情况下实现了最先进的准确性,并具有早期检测功能。此外,MC-LSTM 加速器支持多达 64 个输入通道的并行推断。在 Xilinx ZCU104 上的实现达到了 18,658 帧每秒(FPS)的吞吐量和 3.5 毫秒的推断延迟,批量大小为 64。因此,整个 FPGA+ARM 系统的功耗为 3.6W,与高端 Titan X GPU 和 i7-9700 CPU 相比,分别节能 37.8 倍和 10.4 倍。同时,我们的加速器在搭载 ARM Cortex-A72+A53 CPU 的低功耗高性能 Firefly-RK3399 板上也保持了 4 到 5 倍的能效优势。我们还在相同的硬件上综合了一个 8 位量化版本,在相同的吞吐量下提供了 48.8%的能效提升。