Cui Mingzhang, Li Caihong, Yang Yi
School of Information Science and Engineering, Lanzhou University, Lanzhou 730000, China.
Key Laboratory of Artificial Intelligence and Computing Power Technology, Lanzhou 730000, China.
Sensors (Basel). 2024 Jun 13;24(12):3820. doi: 10.3390/s24123820.
The rapid advancement of sensor technologies and deep learning has significantly advanced the field of image captioning, especially for complex scenes. Traditional image captioning methods are often unable to handle the intricacies and detailed relationships within complex scenes. To overcome these limitations, this paper introduces Explicit Image Caption Reasoning (ECR), a novel approach that generates accurate and informative captions for complex scenes captured by advanced sensors. ECR employs an enhanced inference chain to analyze sensor-derived images, examining object relationships and interactions to achieve deeper semantic understanding. We implement ECR using the optimized ICICD dataset, a subset of the sensor-oriented Flickr30K-EE dataset containing comprehensive inference chain information. This dataset enhances training efficiency and caption quality by leveraging rich sensor data. We create the Explicit Image Caption Reasoning Multimodal Model (ECRMM) by fine-tuning TinyLLaVA with the ICICD dataset. Experiments demonstrate ECR's effectiveness and robustness in processing sensor data, outperforming traditional methods.
传感器技术和深度学习的快速发展显著推动了图像字幕领域的进步,尤其是在处理复杂场景方面。传统的图像字幕方法往往无法处理复杂场景中的错综复杂之处和详细关系。为了克服这些局限性,本文引入了显式图像字幕推理(ECR),这是一种为先进传感器捕获的复杂场景生成准确且信息丰富的字幕的新颖方法。ECR采用增强的推理链来分析传感器衍生的图像,检查对象关系和交互以实现更深入的语义理解。我们使用优化后的ICICD数据集实现了ECR,ICICD数据集是面向传感器的Flickr30K-EE数据集的一个子集,包含全面的推理链信息。该数据集通过利用丰富的传感器数据提高了训练效率和字幕质量。我们通过使用ICICD数据集对TinyLLaVA进行微调,创建了显式图像字幕推理多模态模型(ECRMM)。实验证明了ECR在处理传感器数据方面的有效性和鲁棒性,优于传统方法。