Le Hoang H, Nguyen Duy M H, Bhatti Omair Shahzad, Kopácsi László, Ngo Thinh P, Nguyen Binh T, Barz Michael, Sonntag Daniel
Interactive Machine Learning Department, German Research Center for Artificial Intelligence (DFKI), 66123, Saarbrücken, Germany.
Mathematics and Computer Science Department, University of Science, VNU-HCM, Ho Chi Minh City, Vietnam.
Sci Rep. 2025 Apr 23;15(1):14192. doi: 10.1038/s41598-025-94593-y.
Comprehending how humans process visual information in dynamic settings is crucial for psychology and designing user-centered interactions. While mobile eye-tracking systems combining egocentric video and gaze signals can offer valuable insights, manual analysis of these recordings is time-intensive. In this work, we present a novel human-centered learning algorithm designed for automated object recognition within mobile eye-tracking settings. Our approach seamlessly integrates an object detector with a spatial relation-aware inductive message-passing network (I-MPN), harnessing node profile information and capturing object correlations. Such mechanisms enable us to learn embedding functions capable of generalizing to new object angle views, facilitating rapid adaptation and efficient reasoning in dynamic contexts as users navigate their environment. Through experiments conducted on three distinct video sequences, our interactive-based method showcases significant performance improvements over fixed training/testing algorithms, even when trained on considerably smaller annotated samples collected through user feedback. Furthermore, we demonstrate exceptional efficiency in data annotation processes and surpass prior interactive methods that use complete object detectors, combine detectors with convolutional networks, or employ interactive video segmentation.
理解人类在动态环境中如何处理视觉信息对于心理学和设计以用户为中心的交互至关重要。虽然结合自我中心视频和注视信号的移动眼动追踪系统可以提供有价值的见解,但对这些记录进行人工分析非常耗时。在这项工作中,我们提出了一种新颖的以人类为中心的学习算法,设计用于在移动眼动追踪设置中进行自动目标识别。我们的方法将目标检测器与空间关系感知归纳消息传递网络(I-MPN)无缝集成,利用节点轮廓信息并捕捉目标相关性。这种机制使我们能够学习能够推广到新目标角度视图的嵌入函数,便于在用户在其环境中导航时在动态环境中快速适应和高效推理。通过在三个不同视频序列上进行的实验,我们基于交互的方法展示了相对于固定训练/测试算法的显著性能提升,即使在通过用户反馈收集的明显更小的带注释样本上进行训练时也是如此。此外,我们在数据注释过程中展示了卓越的效率,并超越了使用完整目标检测器、将检测器与卷积网络相结合或采用交互式视频分割的先前交互式方法。