Nursikuwagus Agus, Munir Rinaldi, Khodra Masayu Leylia
Doctoral Program of Informatics, School of Electrical Engineering and Informatics, Institut Teknologi Bandung, Jl. Ganesha No.10, Bandung 40132, Indonesia.
Department of Informatics, School of Electrical Engineering and Informatics, Institut Teknologi Bandung, Jl. Ganesha No.10, Bandung 40132, Indonesia.
J Imaging. 2022 Oct 22;8(11):294. doi: 10.3390/jimaging8110294.
Captioning is the process of assembling a description for an image. Previous research on captioning has usually focused on foreground objects. In captioning concepts, there are two main objects for discussion: background object and foreground object. In contrast to the previous image-captioning research, generating captions from the geological images of rocks is more focused on the background of the images. This study proposed image captioning using a convolutional neural network, long short-term memory, and word2vec to generate words from the image. The proposed model was constructed by a convolutional neural network (CNN), long short-term memory (LSTM), and word2vec and gave a dense output of 256 units. To make it properly grammatical, a sequence of predicted words was reconstructed into a sentence by the beam search algorithm with K = 3. An evaluation of the pre-trained baseline model VGG16 and our proposed CNN-A, CNN-B, CNN-C, and CNN-D models used BLEU score methods for the N-gram. The BLEU scores achieved for BLEU-1 using these models were 0.5515, 0.6463, 0.7012, 0.7620, and 0.5620, respectively. BLEU-2 showed scores of 0.6048, 0.6507, 0.7083, 0.8756, and 0.6578, respectively. BLEU-3 performed with scores of 0.6414, 0.6892, 0.7312, 0.8861, and 0.7307, respectively. Finally, BLEU-4 had scores of 0.6526, 0.6504, 0.7345, 0.8250, and 0.7537, respectively. Our CNN-C model outperformed the other models, especially the baseline model. Furthermore, there are several future challenges in studying captions, such as geological sentence structure, geological sentence phrase, and constructing words by a geological tagger.
图像字幕是为图像组装描述的过程。先前关于图像字幕的研究通常集中在前景物体上。在图像字幕概念中,有两个主要的讨论对象:背景物体和前景物体。与先前的图像字幕研究不同,从岩石的地质图像生成字幕更关注图像的背景。本研究提出了使用卷积神经网络、长短期记忆网络和词向量从图像生成单词的图像字幕方法。所提出的模型由卷积神经网络(CNN)、长短期记忆网络(LSTM)和词向量构建,并给出了256个单元的密集输出。为了使其语法正确,通过K = 3的束搜索算法将预测单词序列重建为句子。对预训练的基线模型VGG16和我们提出的CNN - A、CNN - B、CNN - C和CNN - D模型使用BLEU分数方法进行N元语法评估。使用这些模型在BLEU - 1上获得的BLEU分数分别为0.5515、0.6463、0.7012、0.7620和0.5620。BLEU - 2的分数分别为0.6048、0.6507、0.7083、0.8756和0.6578。BLEU - 3的分数分别为0.6414、0.6892、0.7312、0.8861和0.7307。最后,BLEU - 4的分数分别为0.6526、0.6504、0.7345、0.8250和0.7537。我们的CNN - C模型优于其他模型,尤其是基线模型。此外,在研究字幕方面存在一些未来挑战,例如地质句子结构、地质句子短语以及通过地质标记器构建单词。