Wang Hexu, Luo Wenlong, Wu Wei, Xie Fei, Liu Jindong, Li Jing, Zhang Shizhou
Xi'an Key Laboratory of Human-Machine Integration and Control Technology for Intelligent Rehabilitation, Xijing University, Xi'an 710123, China.
School of Information Science and Technology, Northwest University, Xi'an 710100, China.
Sensors (Basel). 2025 Aug 29;25(17):5362. doi: 10.3390/s25175362.
Unmanned aerial vehicles (UAVs) have become indispensable tools for surveillance, enabled by their ability to capture multi-perspective imagery in dynamic environments. Among critical UAV-based tasks, cross-platform person search-detecting and identifying individuals across distributed camera networks-presents unique challenges. Severe viewpoint variations, occlusions, and cluttered backgrounds in UAV-captured data degrade the performance of conventional discriminative models, which struggle to maintain robustness under such geometric and semantic disparities. To address this, we propose iew-nvariant erson earch (VIPS), a novel two-stage framework combining Faster R-CNN with a view-invariant re-Identification (VIReID) module. Unlike conventional discriminative models, VIPS leverages the semantic flexibility of large vision-language models (VLMs) and adopts a two-stage training strategy to decouple and align text-based ID descriptors and visual features, enabling robust cross-view matching through shared semantic embeddings. To mitigate noise from occlusions and cluttered UAV-captured backgrounds, we introduce a learnable mask generator for feature purification. Furthermore, drawing from vision-language models, we design view prompts to explicitly encode perspective shifts into feature representations, enhancing adaptability to UAV-induced viewpoint changes. Extensive experiments on benchmark datasets demonstrate state-of-the-art performance, with ablation studies validating the efficacy of each component. Beyond technical advancements, this work highlights the potential of VLM-derived semantic alignment for UAV applications, offering insights for future research in real-time UAV-based surveillance systems.
无人机(UAVs)已成为监视不可或缺的工具,这得益于它们在动态环境中捕获多视角图像的能力。在基于无人机的关键任务中,跨平台人员搜索(即在分布式摄像头网络中检测和识别个体)带来了独特的挑战。无人机捕获的数据中存在严重的视角变化、遮挡和杂乱背景,这会降低传统判别模型的性能,这些模型在这种几何和语义差异下难以保持鲁棒性。为了解决这个问题,我们提出了视角不变人员搜索(VIPS),这是一个将Faster R-CNN与视角不变重新识别(VIReID)模块相结合的新颖两阶段框架。与传统判别模型不同,VIPS利用大型视觉语言模型(VLM)的语义灵活性,并采用两阶段训练策略来解耦和对齐基于文本的身份描述符和视觉特征,通过共享语义嵌入实现强大的跨视角匹配。为了减轻无人机捕获的遮挡和杂乱背景产生的噪声,我们引入了一个用于特征净化的可学习掩码生成器。此外,借鉴视觉语言模型,我们设计了视角提示,以将视角转换明确编码到特征表示中,增强对无人机引起的视角变化的适应性。在基准数据集上进行的大量实验证明了其具有领先的性能,消融研究验证了每个组件的有效性。除了技术进步之外,这项工作还突出了基于VLM的语义对齐在无人机应用中的潜力,为基于无人机的实时监视系统的未来研究提供了见解。