IEEE Trans Cybern. 2024 Feb;54(2):679-692. doi: 10.1109/TCYB.2023.3243999. Epub 2024 Jan 17.
Camera-based passive dietary intake monitoring is able to continuously capture the eating episodes of a subject, recording rich visual information, such as the type and volume of food being consumed, as well as the eating behaviors of the subject. However, there currently is no method that is able to incorporate these visual clues and provide a comprehensive context of dietary intake from passive recording (e.g., is the subject sharing food with others, what food the subject is eating, and how much food is left in the bowl). On the other hand, privacy is a major concern while egocentric wearable cameras are used for capturing. In this article, we propose a privacy-preserved secure solution (i.e., egocentric image captioning) for dietary assessment with passive monitoring, which unifies food recognition, volume estimation, and scene understanding. By converting images into rich text descriptions, nutritionists can assess individual dietary intake based on the captions instead of the original images, reducing the risk of privacy leakage from images. To this end, an egocentric dietary image captioning dataset has been built, which consists of in-the-wild images captured by head-worn and chest-worn cameras in field studies in Ghana. A novel transformer-based architecture is designed to caption egocentric dietary images. Comprehensive experiments have been conducted to evaluate the effectiveness and to justify the design of the proposed architecture for egocentric dietary image captioning. To the best of our knowledge, this is the first work that applies image captioning for dietary intake assessment in real-life settings.
基于摄像头的被动饮食摄入监测能够持续捕捉受试者的进食过程,记录丰富的视觉信息,如所消耗食物的类型和数量,以及受试者的进食行为。然而,目前还没有一种方法能够结合这些视觉线索,并从被动记录中提供饮食摄入的综合背景(例如,受试者是否与他人分享食物、受试者正在吃什么食物以及碗里还剩下多少食物)。另一方面,当使用自我中心可穿戴相机进行拍摄时,隐私是一个主要关注点。在本文中,我们提出了一种用于被动监测饮食评估的隐私保护安全解决方案(即自我中心图像字幕生成),它将食物识别、体积估计和场景理解统一起来。通过将图像转换为丰富的文本描述,营养学家可以根据描述而不是原始图像来评估个体的饮食摄入情况,从而降低图像隐私泄露的风险。为此,我们构建了一个自我中心饮食图像字幕生成数据集,该数据集由在加纳实地研究中使用头戴和胸戴相机拍摄的野外图像组成。设计了一种新颖的基于转换器的架构来为自我中心饮食图像生成字幕。我们进行了全面的实验来评估所提出的架构用于自我中心饮食图像字幕生成的有效性和合理性。据我们所知,这是首次将图像字幕生成应用于现实生活环境中的饮食摄入评估的工作。