使用GoldenGATE编辑器对生物系统学传统文献进行半自动XML标记。

Semi-automated XML markup of biosystematic legacy literature with the GoldenGATE editor.

作者信息

Sautter Guido, Böhm Klemens, Agosti Donat

机构信息

Department of Computer Science, Universität Karlsruhe (TH), Am Fasanengarten 5, 76128 Karlsruhe, Germany.

出版信息

Pac Symp Biocomput. 2007:391-402.

PMID:17992751

Abstract

Today, digitization of legacy literature is a big issue. This also applies to the domain of biosystematics, where this process has just started. Digitized biosystematics literature requires a very precise and fine grained markup in order to be useful for detailed search, data linkage and mining. However, manual markup on sentence level and below is cumbersome and time consuming. In this paper, we present and evaluate the GoldenGATE editor, which is designed for the special needs of marking up OCR output with XML. It is built in order to support the user in this process as far as possible: Its functionality ranges from easy, intuitive tagging through markup conversion to dynamic binding of configurable plug-ins provided by third parties. Our evaluation shows that marking up an OCR document using GoldenGATE is three to four times faster than with an off-the-shelf XML editor like XML-Spy. Using domain-specific NLP-based plug-ins, these numbers are even higher.

摘要

如今，传统文献的数字化是一个重大问题。这在生物系统学领域也同样适用，该领域的这一进程才刚刚起步。数字化的生物系统学文献需要非常精确和细粒度的标记，以便用于详细搜索、数据链接和挖掘。然而，在句子及以下层面进行手动标记既繁琐又耗时。在本文中，我们展示并评估了GoldenGATE编辑器，它是为使用XML标记OCR输出的特殊需求而设计的。它的构建目的是在这个过程中尽可能地支持用户：其功能范围从简单直观的标记到标记转换，再到第三方提供的可配置插件的动态绑定。我们的评估表明，使用GoldenGATE标记OCR文档的速度比使用像XML-Spy这样的现成XML编辑器快三到四倍。使用基于特定领域自然语言处理的插件，这些数字甚至更高。

相似文献

Semi-automated XML markup of biosystematic legacy literature with the GoldenGATE editor.

Pac Symp Biocomput. 2007:391-402.

Font adaptive word indexing of modern printed documents.

IEEE Trans Pattern Anal Mach Intell. 2006 Aug;28(8):1187-99. doi: 10.1109/TPAMI.2006.162.

Integrating and visualizing primary data from prospective and legacy taxonomic literature.

Biodivers Data J. 2015 May 12(3):e5063. doi: 10.3897/BDJ.3.e5063. eCollection 2015.

CliniViewer: a tool for viewing electronic medical records based on natural language processing and XML.

Stud Health Technol Inform. 2004;107(Pt 1):639-43.

Representing information in patient reports using natural language processing and the extensible markup language.

J Am Med Inform Assoc. 1999 Jan-Feb;6(1):76-87. doi: 10.1136/jamia.1999.0060076.

Value of XML in the implementation of clinical practice guidelines--the issue of content retrieval and presentation.

Med Inform Internet Med. 2001 Apr-Jun;26(2):131-46.

Distributed modules for text annotation and IE applied to the biomedical domain.

Int J Med Inform. 2006 Jun;75(6):496-500. doi: 10.1016/j.ijmedinf.2005.06.011. Epub 2005 Aug 8.

Internet patient records: new techniques.

J Med Internet Res. 2001 Jan-Mar;3(1):E8. doi: 10.2196/jmir.3.1.e8.

Digitising legacy zoological taxonomic literature: Processes, products and using the output.

Zookeys. 2016 Jan 7(550):189-206. doi: 10.3897/zookeys.550.9702. eCollection 2016.

Strategic reading, ontologies, and the future of scientific publishing.

Science. 2009 Aug 14;325(5942):828-32. doi: 10.1126/science.1157784.

引用本文的文献

Digitising legacy zoological taxonomic literature: Processes, products and using the output.

Zookeys. 2016 Jan 7(550):189-206. doi: 10.3897/zookeys.550.9702. eCollection 2016.

Supporting the annotation of chronic obstructive pulmonary disease (COPD) phenotypes with text mining workflows.

J Biomed Semantics. 2015 Mar 14;6:8. doi: 10.1186/s13326-015-0004-6. eCollection 2015.

Piecing together the biogeographic history of Chenopodium vulvaria L. using botanical literature and collections.

PeerJ. 2015 Jan 8;3:e723. doi: 10.7717/peerj.723. eCollection 2015.

Eupolybothrus cavernicolus Komerički & Stoev sp. n. (Chilopoda: Lithobiomorpha: Lithobiidae): the first eukaryotic species description combining transcriptomic, DNA barcoding and micro-CT imaging data.

Biodivers Data J. 2013 Oct 28(1):e1013. doi: 10.3897/BDJ.1.e1013. eCollection 2013.

Utilizing descriptive statements from the biodiversity heritage library to expand the Hymenoptera Anatomy Ontology.

PLoS One. 2013;8(2):e55674. doi: 10.1371/journal.pone.0055674. Epub 2013 Feb 18.

Applications of natural language processing in biodiversity science.

Adv Bioinformatics. 2012;2012:391574. doi: 10.1155/2012/391574. Epub 2012 May 22.

Towards the bibliography of life.

Zookeys. 2011(150):151-66. doi: 10.3897/zookeys.150.2167. Epub 2011 Nov 28.

XML schemas and mark-up practices of taxonomic literature.

Zookeys. 2011(150):89-116. doi: 10.3897/zookeys.150.2213. Epub 2011 Nov 28.

Semantic annotation of morphological descriptions: an overall strategy.

BMC Bioinformatics. 2010 May 25;11:278. doi: 10.1186/1471-2105-11-278.

LINNAEUS: a species name identification system for biomedical literature.

BMC Bioinformatics. 2010 Feb 11;11:85. doi: 10.1186/1471-2105-11-85.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

使用GoldenGATE编辑器对生物系统学传统文献进行半自动XML标记。

Semi-automated XML markup of biosystematic legacy literature with the GoldenGATE editor.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献