Department of Engineering "Enzo Ferrari", University of Modena and Reggio Emilia, 41125 Modena, Italy.
Department of Education and Humanities, University of Modena and Reggio Emilia, 42121 Reggio Emilia, Italy.
Sensors (Basel). 2023 Jan 23;23(3):1286. doi: 10.3390/s23031286.
Research related to fashion and e-commerce domains is gaining attention in computer vision and multimedia communities. Following this trend, this article tackles the task of generating fine-grained and accurate natural language descriptions of fashion items, a recently-proposed and under-explored challenge that is still far from being solved. To overcome the limitations of previous approaches, a transformer-based captioning model was designed with the integration of external textual memory that could be accessed through -nearest neighbor (NN) searches. From an architectural point of view, the proposed transformer model can read and retrieve items from the external memory through cross-attention operations, and tune the flow of information coming from the external memory thanks to a novel fully attentive gate. Experimental analyses were carried out on the fashion captioning dataset (FACAD) for fashion image captioning, which contains more than 130k fine-grained descriptions, validating the effectiveness of the proposed approach and the proposed architectural strategies in comparison with carefully designed baselines and state-of-the-art approaches. The presented method constantly outperforms all compared approaches, demonstrating its effectiveness for fashion image captioning.
时尚和电子商务领域的研究在计算机视觉和多媒体领域越来越受到关注。本文紧跟这一趋势,针对时尚商品的细粒度和准确的自然语言描述这一最近提出且尚未得到充分探索的挑战展开研究,而这一问题仍远未得到解决。为了克服先前方法的局限性,我们设计了一种基于转换器的字幕模型,该模型集成了可以通过最近邻(NN)搜索访问的外部文本记忆。从架构角度来看,所提出的转换器模型可以通过交叉注意操作读取和从外部内存中检索项目,并通过新颖的全注意门来调整来自外部内存的信息流。我们在时尚字幕数据集(FACAD)上对时尚图像字幕进行了实验分析,该数据集包含超过 13 万条细粒度描述,与精心设计的基线和最先进的方法相比,验证了所提出方法和所提出架构策略的有效性。所提出的方法始终优于所有比较方法,证明了其在时尚图像字幕方面的有效性。