Huang Ying, Zhang Yinhui, He Zifen, Deng Yunnan
Faculty of Mechanical and Electrical Engineering, Kunming University of Science and Technology, Kunming 650500, China.
Sensors (Basel). 2025 Jul 9;25(14):4274. doi: 10.3390/s25144274.
Despite the pivotal role of unmanned aerial vehicles (UAVs) in intelligent inspection tasks, existing video instance segmentation methods struggle with irregular deforming targets, leading to inconsistent segmentation results due to ineffective feature offset capture and temporal correlation modeling. To address this issue, we propose a hierarchical offset compensation and temporal memory update method for video instance segmentation (HT-VIS) with a high generalization ability. Firstly, a hierarchical offset compensation (HOC) module in the form of a sequential and parallel connection is designed to perform deformable offset for the same flexible target across frames, which benefits from compensating for spatial motion features at the time sequence. Next, the temporal memory update (TMU) module is developed by employing convolutional long-short-term memory (ConvLSTM) between the current and adjacent frames to establish the temporal dynamic context correlation and update the current frame feature effectively. Finally, extensive experimental results demonstrate the superiority of the proposed HDNet method when applied to the public YouTubeVIS-2019 dataset and a self-built UAV-Seg segmentation dataset. On four typical datasets (i.e., Zoo, Street, Vehicle, and Sport) extracted from YoutubeVIS-2019 according to category characteristics, the proposed HT-VIS outperforms the state-of-the-art CNN-based VIS methods CrossVIS by 3.9%, 2.0%, 0.3%, and 3.8% in average segmentation accuracy, respectively. On the self-built UAV-VIS dataset, our HT-VIS with PHOC surpasses the baseline SipMask by 2.1% and achieves the highest average segmentation accuracy of 37.4% in the CNN-based methods, demonstrating the effectiveness and robustness of our proposed framework.
尽管无人机在智能检测任务中发挥着关键作用,但现有的视频实例分割方法在处理不规则变形目标时存在困难,由于特征偏移捕获和时间相关性建模无效,导致分割结果不一致。为了解决这个问题,我们提出了一种具有高泛化能力的视频实例分割分层偏移补偿和时间记忆更新方法(HT-VIS)。首先,设计了一种顺序和并行连接形式的分层偏移补偿(HOC)模块,用于对同一灵活目标跨帧执行可变形偏移,这得益于在时间序列上补偿空间运动特征。接下来,通过在当前帧和相邻帧之间采用卷积长短期记忆(ConvLSTM)来开发时间记忆更新(TMU)模块,以建立时间动态上下文相关性并有效地更新当前帧特征。最后,大量实验结果表明,所提出的HDNet方法应用于公共YouTubeVIS-2019数据集和自建的UAV-Seg分割数据集时具有优越性。在根据类别特征从YoutubeVIS-2019中提取的四个典型数据集(即动物园、街道、车辆和体育)上,所提出的HT-VIS在平均分割精度上分别比基于CNN的最先进VIS方法CrossVIS高出3.9%、2.0%、0.3%和3.8%。在自建的UAV-VIS数据集上,我们的带有PHOC的HT-VIS比基线SipMask高出2.1%,并在基于CNN的方法中实现了37.4%的最高平均分割精度,证明了我们提出的框架的有效性和鲁棒性。