Nfor Kintoh Allen, Theodore Armand Tagne Poupi, Ismaylovna Kenesbaeva Periyzat, Joo Moon-Il, Kim Hee-Cheol
Department of Computer Engineering, Inje University, Gimhae 50834, Republic of Korea.
Institute of Digital Anti-Aging Healthcare, Inje University, Gimhae 50834, Republic of Korea.
Nutrients. 2025 Jan 20;17(2):362. doi: 10.3390/nu17020362.
Food image recognition, a crucial step in computational gastronomy, has diverse applications across nutritional platforms. Convolutional neural networks (CNNs) are widely used for this task due to their ability to capture hierarchical features. However, they struggle with long-range dependencies and global feature extraction, which are vital in distinguishing visually similar foods or images where the context of the whole dish is crucial, thus necessitating transformer architecture.
This research explores the capabilities of the CNNs and transformers to build a robust classification model that can handle both short- and long-range dependencies with global features to accurately classify food images and enhance food image recognition for better nutritional analysis.
Our approach, which combines CNNs and Vision Transformers (ViTs), begins with the RestNet50 backbone model. This model is responsible for local feature extraction from the input image. The resulting feature map is then passed to the ViT encoder block, which handles further global feature extraction and classification using multi-head attention and fully connected layers with pre-trained weights.
Our experiments on five diverse datasets have confirmed a superior performance compared to the current state-of-the-art methods, and our combined dataset leveraging complementary features showed enhanced generalizability and robust performance in addressing global food diversity. We used explainable techniques like grad-CAM and LIME to understand how the models made their decisions, thereby enhancing the user's trust in the proposed system. This model has been integrated into a mobile application for food recognition and nutrition analysis, offering features like an intelligent diet-tracking system.
This research paves the way for practical applications in personalized nutrition and healthcare, showcasing the extensive potential of AI in nutritional sciences across various dietary platforms.
食物图像识别是计算美食学中的关键步骤,在营养平台上有多种应用。卷积神经网络(CNN)因其能够捕捉分层特征而被广泛用于此任务。然而,它们在处理长距离依赖和全局特征提取方面存在困难,而这对于区分视觉上相似的食物或整个菜肴背景至关重要的图像来说至关重要,因此需要变压器架构。
本研究探索CNN和变压器构建强大分类模型的能力,该模型能够处理具有全局特征的短距离和长距离依赖,以准确分类食物图像并增强食物图像识别,从而进行更好的营养分析。
我们的方法结合了CNN和视觉变压器(ViT),从RestNet50骨干模型开始。该模型负责从输入图像中提取局部特征。然后将生成的特征图传递给ViT编码器块,该块使用多头注意力和具有预训练权重的全连接层处理进一步的全局特征提取和分类。
我们在五个不同数据集上的实验证实,与当前最先进的方法相比,性能更优,并且我们利用互补特征的组合数据集在解决全球食物多样性方面表现出更强的泛化能力和稳健性能。我们使用grad-CAM和LIME等可解释技术来理解模型如何做出决策,从而增强用户对所提出系统的信任。该模型已集成到一个用于食物识别和营养分析的移动应用程序中,提供智能饮食跟踪系统等功能。
本研究为个性化营养和医疗保健的实际应用铺平了道路,展示了人工智能在跨各种饮食平台的营养科学中的广泛潜力。