用于图像字幕生成的有效预训练方法及其组合智能。

Effective Pre-Training Method and Its Compositional Intelligence for Image Captioning.

机构信息

Artificial Intelligence Laboratory, Hanyang University, Seoul 04763, Korea.

出版信息

Sensors (Basel). 2022 Apr 30;22(9):3433. doi: 10.3390/s22093433.

DOI:10.3390/s22093433

PMID:35591124

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC9099892/

Abstract

With the increase in the performance of deep learning models, the model parameter has increased exponentially. An increase in model parameters leads to an increase in computation and training time, i.e., an increase in training cost. To reduce the training cost, we propose Compositional Intelligence (CI). This is a reuse method that combines pre-trained models for different tasks. Since the CI uses a well-trained model, good performance and small training cost can be expected in the target task. We applied the CI to the Image Captioning task. Compared to using a trained feature extractor, the caption generator is usually trained from scratch. On the other hand, we pre-trained the Transformer model as a caption generator and applied CI, i.e., we used a pre-trained feature extractor and a pre-trained caption generator. To compare the training cost of the From Scratch model and the CI model, early stopping was applied during fine-tuning of the image captioning task. On the MS-COCO dataset, the vanilla image captioning model reduced training cost by 13.8% and improved performance by up to 3.2%, and the Object Relation Transformer model reduced training cost by 21.3%.

摘要

随着深度学习模型性能的提高，模型参数呈指数级增长。模型参数的增加会导致计算和训练时间的增加，即训练成本的增加。为了降低训练成本，我们提出了组合智能 (CI)。这是一种复用方法，它结合了不同任务的预训练模型。由于 CI 使用了经过良好训练的模型，因此可以预期在目标任务中具有良好的性能和较小的训练成本。我们将 CI 应用于图像字幕任务。与使用训练有素的特征提取器相比，字幕生成器通常需要从头开始训练。另一方面，我们将 Transformer 模型预训练为字幕生成器，并应用 CI，即使用预训练的特征提取器和预训练的字幕生成器。为了比较从头开始模型和 CI 模型的训练成本，我们在微调图像字幕任务时应用了早期停止。在 MS-COCO 数据集上，香草图像字幕模型将训练成本降低了 13.8%，性能提高了 3.2%，而 Object Relation Transformer 模型将训练成本降低了 21.3%。