Lee Kyungjun, Shrivastava Abhinav, Kacorri Hernisa
University of Maryland, College Park.
IEEE Winter Conf Appl Comput Vis. 2020 Mar;2020:3411-3421. doi: 10.1109/wacv45572.2020.9093353. Epub 2020 May 14.
Egocentric vision holds great promises for increasing access to visual information and improving the quality of life for people with visual impairments, with object recognition being one of the daily challenges for this population. While we strive to improve recognition performance, it remains difficult to identify which object is of interest to the user; the object may not even be included in the frame due to challenges in camera aiming without visual feedback. Also, gaze information, commonly used to infer the area of interest in egocentric vision, is often not dependable. However, blind users often tend to include their hand either interacting with the object that they wish to recognize or simply placing it in proximity for better camera aiming. We propose localization models that leverage the presence of the hand as the contextual information for priming the center area of the object of interest. In our approach, hand segmentation is fed to either the entire localization network or its last convolutional layers. Using egocentric datasets from sighted and blind individuals, we show that the hand-priming achieves higher precision than other approaches, such as fine-tuning, multi-class, and multi-task learning, which also encode hand-object interactions in localization.
以自我为中心的视觉有望极大地增加视障人士获取视觉信息的机会并改善他们的生活质量,物体识别是这一人群日常面临的挑战之一。虽然我们努力提高识别性能,但仍然很难确定用户感兴趣的是哪个物体;由于没有视觉反馈的情况下相机瞄准存在挑战,物体甚至可能不在画面中。此外,通常用于推断以自我为中心视觉中感兴趣区域的注视信息往往不可靠。然而,盲人用户往往会将手包括在内,要么与他们想要识别的物体进行交互,要么只是将手放在附近以便更好地进行相机瞄准。我们提出了定位模型,利用手的存在作为上下文信息来启动感兴趣物体的中心区域。在我们的方法中,手部分割被输入到整个定位网络或其最后一个卷积层。使用来自有视力和盲人个体的以自我为中心的数据集,我们表明手部启动比其他方法(如微调、多类和多任务学习)具有更高的精度,这些方法也在定位中对手部与物体的交互进行编码。