Klasson Marcus, Zhang Cheng, Kjellström Hedvig
Division of Robotics, Perception, and Learning, Lindstedtsvägen 24, 114 28 Stockholm, Sweden.
Microsoft Research Ltd, 21 Station Road, Cambridge CB1 2FB, UK.
Patterns (N Y). 2020 Nov 13;1(8):100143. doi: 10.1016/j.patter.2020.100143.
An essential task for computer vision-based assistive technologies is to help visually impaired people to recognize objects in constrained environments, for instance, recognizing food items in grocery stores. In this paper, we introduce a novel dataset with natural images of groceries-fruits, vegetables, and packaged products-where all images have been taken inside grocery stores to resemble a shopping scenario. Additionally, we download iconic images and text descriptions for each item that can be utilized for better representation learning of groceries. We select a multi-view generative model, which can combine the different item information into lower-dimensional representations. The experiments show that utilizing the additional information yields higher accuracies on classifying grocery items than only using the natural images. We observe that iconic images help to construct representations separated by visual differences of the items, while text descriptions enable the model to distinguish between visually similar items by different ingredients.
基于计算机视觉的辅助技术的一项重要任务是帮助视障人士在受限环境中识别物体,例如,在杂货店中识别食品。在本文中,我们引入了一个包含杂货店自然图像的新颖数据集,这些图像包括水果、蔬菜和包装产品,所有图像均在杂货店内部拍摄,以模拟购物场景。此外,我们为每个物品下载了标志性图像和文本描述,可用于更好地进行杂货店商品的表征学习。我们选择了一种多视图生成模型,它可以将不同的物品信息组合成低维表示。实验表明,利用这些额外信息在对杂货店商品进行分类时比仅使用自然图像具有更高的准确率。我们观察到,标志性图像有助于构建因物品视觉差异而分离的表示,而文本描述使模型能够通过不同的成分区分视觉上相似的物品。