IEEE J Biomed Health Inform. 2022 Oct;26(10):5223-5234. doi: 10.1109/JBHI.2022.3193148. Epub 2022 Oct 4.
The popularity of convolutional architecture has made sensor-based human activity recognition (HAR) become one primary beneficiary. By simply superimposing multiple convolution layers, the local features can be effectively captured from multi-channel time series sensor data, which could output high-performance activity prediction results. On the other hand, recent years have witnessed great success of Transformer model, which uses powerful self-attention mechanism to handle long-range sequence modeling tasks, hence avoiding the shortcoming of local feature representations caused by convolutional neural networks (CNNs). In this paper, we seek to combine the merits of CNN and Transformer to model multi-channel time series sensor data, which might provide compelling recognition performance with fewer parameters and FLOPs based on lightweight wearable devices. To this end, we propose a new Dual-branch Interactive Network (DIN) that inherits the advantages from both CNN and Transformer to handle multi-channel time series for HAR. Specifically, the proposed framework utilizes two-stream architecture to disentangle local and global features by performing conv-embedding and patch-embedding, where a co-attention mechanism is used to adaptively fuse global-to-local and local-to-global feature representations. We perform extensive experiments on three mainstream HAR benchmark datasets including PAMAP2, WISDM, and OPPORTUNITY, which verify that our method consistently outperforms several state-of-the-art baselines, reaching an F1-score of 92.05%, 98.17%, and 91.55% respectively with fewer parameters and FLOPs. In addition, the practical execution time is validated on an embedded Raspberry Pi P3 system, which demonstrates that our approach is adequately efficient for real-time HAR implementations and deserves as a better alternative in ubiquitous HAR computing scenario. Our model code will be released soon.
卷积架构的流行使得基于传感器的人体活动识别 (HAR) 成为主要受益者之一。通过简单地叠加多个卷积层,可以有效地从多通道时间序列传感器数据中捕获局部特征,从而输出高性能的活动预测结果。另一方面,近年来 Transformer 模型取得了巨大成功,它使用强大的自注意力机制来处理长序列建模任务,从而避免了卷积神经网络 (CNN) 引起的局部特征表示的缺点。在本文中,我们试图结合 CNN 和 Transformer 的优点来对多通道时间序列传感器数据进行建模,这可能基于轻量级可穿戴设备以更少的参数和 FLOPs 提供有竞争力的识别性能。为此,我们提出了一种新的双分支交互网络 (DIN),它继承了 CNN 和 Transformer 的优点,用于处理 HAR 的多通道时间序列。具体来说,所提出的框架利用两流体系结构通过执行 conv-embedding 和 patch-embedding 来解耦局部和全局特征,其中使用协同注意机制自适应地融合全局到局部和局部到全局特征表示。我们在包括 PAMAP2、WISDM 和 OPPORTUNITY 在内的三个主流 HAR 基准数据集上进行了广泛的实验,验证了我们的方法始终优于几个最先进的基线,分别达到 92.05%、98.17%和 91.55%的 F1 分数,同时参数和 FLOPs 更少。此外,在嵌入式 Raspberry Pi P3 系统上验证了实际的执行时间,这表明我们的方法对于实时 HAR 实现足够高效,并且在无处不在的 HAR 计算场景中是一种更好的替代方案。我们的模型代码将很快发布。