Rong Xuejian, Yi Chucai, Tian Yingli
IEEE Trans Image Process. 2019 Jul 26. doi: 10.1109/TIP.2019.2930176.
Text instance provides valuable information for the understanding and interpretation of natural scenes. The rich, precise high-level semantics embodied in the text could be beneficial for understanding the world around us, and empower a wide range of real-world applications. While most recent visual phrase grounding approaches focus on general objects, this paper explores extracting designated texts and predicting unambiguous scene text segmentation mask, i.e. scene text segmentation from natural language descriptions (referring expressions) like orange text on a little boy in black swinging a bat. The solution of this novel problem enables accurate segmentation of scene text instances from the complex background. In our proposed framework, a unified deep network jointly models visual and linguistic information by encoding both region-level and pixel-level visual features of natural scene images into spatial feature maps, and then decode them into saliency response map of text instances. To conduct quantitative evaluations, we establish a new scene text referring expression segmentation dataset: COCO-CharRef. Experimental results demonstrate the effectiveness of the proposed framework on the text instance segmentation task. By combining image-based visual features with language-based textual explanations, our framework outperforms baselines that are derived from state-of-the-art text localization and natural language object retrieval methods on COCO-CharRef dataset.
文本实例为理解和解释自然场景提供了有价值的信息。文本中所体现的丰富、精确的高级语义有助于我们理解周围的世界,并为广泛的现实世界应用提供支持。虽然最近大多数视觉短语定位方法都集中在一般物体上,但本文探索提取指定文本并预测明确的场景文本分割掩码,即从自然语言描述(指代表达式)中进行场景文本分割,例如“一个穿着黑色衣服的小男孩挥舞着球棒时的橙色文字”。解决这个新问题能够从复杂背景中准确分割出场景文本实例。在我们提出的框架中,一个统一的深度网络通过将自然场景图像的区域级和像素级视觉特征编码到空间特征图中,从而联合对视觉和语言信息进行建模,然后将它们解码为文本实例的显著性响应图。为了进行定量评估,我们建立了一个新的场景文本指代表达式分割数据集:COCO-CharRef。实验结果证明了所提出框架在文本实例分割任务上的有效性。通过将基于图像的视觉特征与基于语言的文本解释相结合,我们的框架在COCO-CharRef数据集上优于源自最新文本定位和自然语言对象检索方法的基线。