Tang Zhengmi, Miyazaki Tomo, Omachi Shinichiro
IEEE Trans Image Process. 2023;32:5837-5851. doi: 10.1109/TIP.2023.3326685. Epub 2023 Nov 1.
Scene-text image synthesis techniques that aim to naturally compose text instances on background scene images are very appealing for training deep neural networks due to their ability to provide accurate and comprehensive annotation information. Prior studies have explored generating synthetic text images on two-dimensional and three-dimensional surfaces using rules derived from real-world observations. Some of these studies have proposed generating scene-text images through learning; however, owing to the absence of a suitable training dataset, unsupervised frameworks have been explored to learn from existing real-world data, which might not yield reliable performance. To ease this dilemma and facilitate research on learning-based scene text synthesis, we introduce DecompST, a real-world dataset prepared from some public benchmarks, containing three types of annotations: quadrilateral-level BBoxes, stroke-level text masks, and text-erased images. Leveraging the DecompST dataset, we propose a Learning-Based Text Synthesis engine (LBTS) that includes a text location proposal network (TLPNet) and a text appearance adaptation network (TAANet). TLPNet first predicts the suitable regions for text embedding, after which TAANet adaptively adjusts the geometry and color of the text instance to match the background context. After training, those networks can be integrated and utilized to generate the synthetic dataset for scene text analysis tasks. Comprehensive experiments were conducted to validate the effectiveness of the proposed LBTS along with existing methods, and the experimental results indicate the proposed LBTS can generate better pretraining data for scene text detectors. Our dataset and code are made available at: https://github.com/iiclab/DecompST.
旨在在背景场景图像上自然合成文本实例的场景文本图像合成技术,因其能够提供准确而全面的标注信息,对于训练深度神经网络非常有吸引力。先前的研究已经探索了使用从现实世界观察中得出的规则在二维和三维表面上生成合成文本图像。其中一些研究提出通过学习来生成场景文本图像;然而,由于缺乏合适的训练数据集,人们探索了无监督框架以从现有的现实世界数据中学习,而这可能无法产生可靠的性能。为了缓解这一困境并促进基于学习的场景文本合成研究,我们引入了DecompST,这是一个从一些公共基准准备的现实世界数据集,包含三种类型的标注:四边形级别的边界框、笔画级别的文本掩码和文本擦除图像。利用DecompST数据集,我们提出了一种基于学习的文本合成引擎(LBTS),它包括一个文本位置提议网络(TLPNet)和一个文本外观适配网络(TAANet)。TLPNet首先预测适合文本嵌入的区域,之后TAANet自适应地调整文本实例的几何形状和颜色以匹配背景上下文。训练后,这些网络可以集成并用于生成用于场景文本分析任务的合成数据集。我们进行了全面的实验来验证所提出的LBTS以及现有方法的有效性,实验结果表明所提出的LBTS可以为场景文本检测器生成更好的预训练数据。我们的数据集和代码可在以下网址获取:https://github.com/iiclab/DecompST 。