Sautter Guido, Böhm Klemens, Agosti Donat
Department of Computer Science, Universität Karlsruhe (TH), Am Fasanengarten 5, 76128 Karlsruhe, Germany.
Pac Symp Biocomput. 2007:391-402.
Today, digitization of legacy literature is a big issue. This also applies to the domain of biosystematics, where this process has just started. Digitized biosystematics literature requires a very precise and fine grained markup in order to be useful for detailed search, data linkage and mining. However, manual markup on sentence level and below is cumbersome and time consuming. In this paper, we present and evaluate the GoldenGATE editor, which is designed for the special needs of marking up OCR output with XML. It is built in order to support the user in this process as far as possible: Its functionality ranges from easy, intuitive tagging through markup conversion to dynamic binding of configurable plug-ins provided by third parties. Our evaluation shows that marking up an OCR document using GoldenGATE is three to four times faster than with an off-the-shelf XML editor like XML-Spy. Using domain-specific NLP-based plug-ins, these numbers are even higher.
如今,传统文献的数字化是一个重大问题。这在生物系统学领域也同样适用,该领域的这一进程才刚刚起步。数字化的生物系统学文献需要非常精确和细粒度的标记,以便用于详细搜索、数据链接和挖掘。然而,在句子及以下层面进行手动标记既繁琐又耗时。在本文中,我们展示并评估了GoldenGATE编辑器,它是为使用XML标记OCR输出的特殊需求而设计的。它的构建目的是在这个过程中尽可能地支持用户:其功能范围从简单直观的标记到标记转换,再到第三方提供的可配置插件的动态绑定。我们的评估表明,使用GoldenGATE标记OCR文档的速度比使用像XML-Spy这样的现成XML编辑器快三到四倍。使用基于特定领域自然语言处理的插件,这些数字甚至更高。