Ho Vo Hoang Duy, Vo Quoc Huy, Hung Bui Thanh
Data Science Laboratory/Data Science Department/Faculty of Information Technology, Industrial University of Ho Chi Minh City, Ho Chi Minh, Vietnam.
PeerJ Comput Sci. 2024 Nov 28;10:e2536. doi: 10.7717/peerj-cs.2536. eCollection 2024.
Extracting information from scanned images is a critical task with far-reaching practical implications. Traditional methods often fall short by inadequately leveraging both image and text features, leading to less accurate and efficient outcomes. In this study, we introduce ConBGAT, a cutting-edge model that seamlessly integrates convolutional neural networks (CNNs), Transformers, and graph attention networks to address these shortcomings. Our approach constructs detailed graphs from text regions within images, utilizing advanced Optical Character Recognition to accurately detect and interpret characters. By combining superior extracted features of CNNs for image and Distilled Bidirectional Encoder Representations from Transformers (DistilBERT) for text, our model achieves a comprehensive and efficient data representation. Rigorous testing on real-world datasets shows that ConBGAT significantly outperforms existing methods, demonstrating its superior capability across multiple evaluation metrics. This advancement not only enhances accuracy but also sets a new benchmark for information extraction in scanned image.
从扫描图像中提取信息是一项具有深远实际意义的关键任务。传统方法往往因无法充分利用图像和文本特征而有所不足,导致结果的准确性和效率较低。在本研究中,我们引入了ConBGAT,这是一种前沿模型,它无缝集成了卷积神经网络(CNN)、Transformer和图注意力网络来解决这些缺点。我们的方法从图像中的文本区域构建详细的图,利用先进的光学字符识别技术准确检测和解释字符。通过结合CNN用于图像的卓越提取特征和Transformer的蒸馏双向编码器表示(DistilBERT)用于文本,我们的模型实现了全面而高效的数据表示。在真实世界数据集上的严格测试表明,ConBGAT显著优于现有方法,在多个评估指标上展示了其卓越能力。这一进展不仅提高了准确性,还为扫描图像中的信息提取设定了新的基准。