Suppr超能文献

卷积增强双分支自适应转换器,具有跨任务交互作用,用于食品类别和成分识别。

Convolution-Enhanced Bi-Branch Adaptive Transformer With Cross-Task Interaction for Food Category and Ingredient Recognition.

出版信息

IEEE Trans Image Process. 2024;33:2572-2586. doi: 10.1109/TIP.2024.3374211. Epub 2024 Apr 1.

Abstract

Recently, visual food analysis has received more and more attention in the computer vision community due to its wide application scenarios, e.g., diet nutrition management, smart restaurant, and personalized diet recommendation. Considering that food images are unstructured images with complex and unfixed visual patterns, mining food-related semantic-aware regions is crucial. Furthermore, the ingredients contained in food images are semantically related to each other due to the cooking habits and have significant semantic relationships with food categories under the hierarchical food classification ontology. Therefore, modeling the long-range semantic relationships between ingredients and the categories-ingredients semantic interactions is beneficial for ingredient recognition and food analysis. Taking these factors into consideration, we propose a multi-task learning framework for food category and ingredient recognition. This framework mainly consists of a food-orient Transformer named Convolution-Enhanced Bi-Branch Adaptive Transformer (CBiAFormer) and a multi-task category-ingredient recognition network called Structural Learning and Cross-Task Interaction (SLCI). In order to capture the complex and unfixed fine-grained patterns of food images, we propose a query-aware data-adaptive attention mechanism called Bi-Branch Adaptive Attention (BiA-Attention) in CBiAFormer, which consists of a local fine-grained branch and a global coarse-grained branch to mine local and global semantic-aware regions for different input images through an adaptive candidate key/value sets assignment for each query. Additionally, a convolutional patch embedding module is proposed to extract the fine-grained features which are neglected by Transformers. To fully utilize the ingredient information, we propose SLCI, which consists of cross-layer attention to model the semantic relationships between ingredients and two cross-task interaction modules to mine the semantic interactions between categories and ingredients. Extensive experiments show that our method achieves competitive performance on three mainstream food datasets (ETH Food-101, Vireo Food-172, and ISIA Food-200). Visualization analyses of CBiAFormer and SLCI on two tasks prove the effectiveness of our method. Codes will be released upon publication. Code and models are available at https://github.com/Liuyuxinict/CBiAFormer.

摘要

最近,由于其广泛的应用场景,例如饮食营养管理、智能餐厅和个性化饮食推荐,视觉食品分析在计算机视觉领域受到了越来越多的关注。考虑到食品图像是具有复杂和不固定视觉模式的非结构化图像,挖掘与食品相关的语义感知区域至关重要。此外,由于烹饪习惯,食品图像中包含的成分在语义上是相互关联的,并且在分层食品分类本体下与食品类别具有显著的语义关系。因此,对成分和类别-成分语义交互之间的长距离语义关系进行建模有利于成分识别和食品分析。考虑到这些因素,我们提出了一种用于食品类别和成分识别的多任务学习框架。该框架主要由一个名为 Convolution-Enhanced Bi-Branch Adaptive Transformer(CBiAFormer)的食品定向 Transformer 和一个名为 Structural Learning and Cross-Task Interaction(SLCI)的多任务类别-成分识别网络组成。为了捕获食品图像的复杂和不固定的细粒度模式,我们在 CBiAFormer 中提出了一种称为 Bi-Branch Adaptive Attention(BiA-Attention)的查询感知数据自适应注意机制,它由一个局部细粒度分支和一个全局粗粒度分支组成,通过为每个查询自适应分配候选键/值对来挖掘不同输入图像的局部和全局语义感知区域。此外,还提出了卷积补丁嵌入模块来提取 Transformer 忽略的细粒度特征。为了充分利用成分信息,我们提出了 SLCI,它由跨层注意力组成,用于建模成分之间的语义关系,以及两个跨任务交互模块,用于挖掘类别和成分之间的语义交互。广泛的实验表明,我们的方法在三个主流食品数据集(ETH Food-101、Vireo Food-172 和 ISIA Food-200)上取得了有竞争力的性能。在两个任务上对 CBiAFormer 和 SLCI 的可视化分析证明了我们方法的有效性。代码将在发表后公布。代码和模型可在 https://github.com/Liuyuxinict/CBiAFormer 上获得。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验