Suppr超能文献

使用GoldenGATE编辑器对生物系统学传统文献进行半自动XML标记。

Semi-automated XML markup of biosystematic legacy literature with the GoldenGATE editor.

作者信息

Sautter Guido, Böhm Klemens, Agosti Donat

机构信息

Department of Computer Science, Universität Karlsruhe (TH), Am Fasanengarten 5, 76128 Karlsruhe, Germany.

出版信息

Pac Symp Biocomput. 2007:391-402.

Abstract

Today, digitization of legacy literature is a big issue. This also applies to the domain of biosystematics, where this process has just started. Digitized biosystematics literature requires a very precise and fine grained markup in order to be useful for detailed search, data linkage and mining. However, manual markup on sentence level and below is cumbersome and time consuming. In this paper, we present and evaluate the GoldenGATE editor, which is designed for the special needs of marking up OCR output with XML. It is built in order to support the user in this process as far as possible: Its functionality ranges from easy, intuitive tagging through markup conversion to dynamic binding of configurable plug-ins provided by third parties. Our evaluation shows that marking up an OCR document using GoldenGATE is three to four times faster than with an off-the-shelf XML editor like XML-Spy. Using domain-specific NLP-based plug-ins, these numbers are even higher.

摘要

如今,传统文献的数字化是一个重大问题。这在生物系统学领域也同样适用,该领域的这一进程才刚刚起步。数字化的生物系统学文献需要非常精确和细粒度的标记,以便用于详细搜索、数据链接和挖掘。然而,在句子及以下层面进行手动标记既繁琐又耗时。在本文中,我们展示并评估了GoldenGATE编辑器,它是为使用XML标记OCR输出的特殊需求而设计的。它的构建目的是在这个过程中尽可能地支持用户:其功能范围从简单直观的标记到标记转换,再到第三方提供的可配置插件的动态绑定。我们的评估表明,使用GoldenGATE标记OCR文档的速度比使用像XML-Spy这样的现成XML编辑器快三到四倍。使用基于特定领域自然语言处理的插件,这些数字甚至更高。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验