Lu Shijian, Li Linlin, Tan Chew Lim
Institute for Infocomm Research, Agency for Science,Technology and Research (A*STAR), 21 Heng Mui Keng Terrace, Singapore.
IEEE Trans Pattern Anal Mach Intell. 2008 Nov;30(11):1913-8. doi: 10.1109/TPAMI.2008.89.
This paper presents a document retrieval technique that is capable of searching document images without OCR (optical character recognition). The proposed technique retrieves document images by a new word shape coding scheme, which captures the document content through annotating each word image by a word shape code. In particular, we annotate word images by using a set of topological shape features including character ascenders/descenders, character holes, and character water reservoirs. With the annotated word shape codes, document images can be retrieved by either query keywords or a query document image. Experimental results show that the proposed document image retrieval technique is fast, efficient, and tolerant to various types of document degradation.
本文提出了一种文档检索技术,该技术能够在不进行光学字符识别(OCR)的情况下搜索文档图像。所提出的技术通过一种新的单词形状编码方案来检索文档图像,该方案通过用单词形状代码注释每个单词图像来捕获文档内容。具体而言,我们使用一组拓扑形状特征(包括字符上伸部/下伸部、字符空洞和字符积水区)来注释单词图像。利用注释后的单词形状代码,可以通过查询关键词或查询文档图像来检索文档图像。实验结果表明,所提出的文档图像检索技术快速、高效,并且能够容忍各种类型的文档退化。