Suppr超能文献

一种用于从历史手写文档中自动分割单词的尺度空间方法。

A scale space approach for automatically segmenting words from historical handwritten documents.

作者信息

Manmatha R, Rothfeder Jamie L

机构信息

Center for Intelligent Information Retrieval, Department of Computer Science, University of Massachusetts, Amherst, 140 Governors Dr., Amherst, MA 01003, USA.

出版信息

IEEE Trans Pattern Anal Mach Intell. 2005 Aug;27(8):1212-25. doi: 10.1109/TPAMI.2005.150.

Abstract

Many libraries, museums, and other organizations contain large collections of handwritten historical documents, for example, the papers of early presidents like George Washington at the Library of Congress. The first step in providing recognition/ retrieval tools is to automatically segment handwritten pages into words. State of the art segmentation techniques like the gap metrics algorithm have been mostly developed and tested on highly constrained documents like bank checks and postal addresses. There has been little work on full handwritten pages and this work has usually involved testing on clean artificial documents created for the purpose of research. Historical manuscript images, on the other hand, contain a great deal of noise and are much more challenging. Here, a novel scale space algorithm for automatically segmenting handwritten (historical) documents into words is described. First, the page is cleaned to remove margins. This is followed by a gray-level projection profile algorithm for finding lines in images. Each line image is then filtered with an anisotropic Laplacian at several scales. This procedure produces blobs which correspond to portions of characters at small scales and to words at larger scales. Crucial to the algorithm is scale selection, that is, finding the optimum scale at which blobs correspond to words. This is done by finding the maximum over scale of the extent or area of the blobs. This scale maximum is estimated using three different approaches. The blobs recovered at the optimum scale are then bounded with a rectangular box to recover the words. A postprocessing filtering step is performed to eliminate boxes of unusual size which are unlikely to correspond to words. The approach is tested on a number of different data sets and it is shown that, on 100 sampled documents from the George Washington corpus of handwritten document images, a total error rate of 17 percent is observed. The technique outperforms a state-of-the-art gap metrics word-segmentation algorithm on this collection.

摘要

许多图书馆、博物馆和其他机构都收藏有大量手写历史文献,例如,美国国会图书馆收藏的乔治·华盛顿等早期总统的文件。提供识别/检索工具的第一步是将手写页面自动分割成单词。诸如间隙度量算法等先进的分割技术大多是在诸如银行支票和邮政地址等高度受限的文档上开发和测试的。对于完整的手写页面,相关研究很少,而且这类研究通常涉及在为研究目的而创建的干净人工文档上进行测试。另一方面,历史手稿图像包含大量噪声,更具挑战性。本文描述了一种用于将手写(历史)文档自动分割成单词的新型尺度空间算法。首先,对页面进行清理以去除页边距。接下来是一个灰度投影轮廓算法,用于在图像中查找线条。然后,对每个线条图像在多个尺度上用各向异性拉普拉斯算子进行滤波。这个过程会产生斑点,在小尺度下这些斑点对应于字符的部分,在大尺度下对应于单词。该算法的关键是尺度选择,即找到斑点对应于单词的最佳尺度。这是通过找到斑点范围或面积在尺度上的最大值来实现的。使用三种不同的方法来估计这个尺度最大值。然后,在最佳尺度下恢复的斑点用矩形框框定以恢复单词。执行后处理滤波步骤以消除大小异常、不太可能对应于单词的框。该方法在多个不同数据集上进行了测试,结果表明,在从乔治·华盛顿手写文档图像语料库中抽取的100份文档上,观察到的总错误率为17%。在这个数据集上,该技术优于一种先进的间隙度量单词分割算法。

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验