Johnson Helen L, Baumgartner William A, Krallinger Martin, Cohen K Bretonnel, Hunter Lawrence
Center for Computational Pharmacology, University of Colorado School of Medicine, Aurora, CO, USA.
J Biomed Discov Collab. 2007 Sep 13;2:4. doi: 10.1186/1747-5333-2-4.
Most biomedical corpora have not been used outside of the lab that created them, despite the fact that the availability of the gold-standard evaluation data that they provide is one of the rate-limiting factors for the progress of biomedical text mining. Data suggest that one major factor affecting the use of a corpus outside of its home laboratory is the format in which it is distributed. This paper tests the hypothesis that corpus refactoring - changing the format of a corpus without altering its semantics - is a feasible goal, namely that it can be accomplished with a semi-automatable process and in a time-effcient way. We used simple text processing methods and limited human validation to convert the Protein Design Group corpus into two new formats: WordFreak and embedded XML. We tracked the total time expended and the success rates of the automated steps.
The refactored corpus is available for download at the BioNLP SourceForge website http://bionlp.sourceforge.net. The total time expended was just over three person-weeks, consisting of about 102 hours of programming time (much of which is one-time development cost) and 20 hours of manual validation of automatic outputs. Additionally, the steps required to refactor any corpus are presented.
We conclude that refactoring of publicly available corpora is a technically and economically feasible method for increasing the usage of data already available for evaluating biomedical language processing systems.
尽管大多数生物医学语料库所提供的金标准评估数据的可用性是生物医学文本挖掘进展的限速因素之一,但这些语料库大多仅在创建它们的实验室内部使用。数据表明,影响语料库在其创建实验室之外使用的一个主要因素是其分发格式。本文检验了这样一个假设,即语料库重构——在不改变语义的情况下改变语料库的格式——是一个可行的目标,也就是说,可以通过一个半自动化的过程高效地完成。我们使用简单的文本处理方法并进行有限的人工验证,将蛋白质设计集团语料库转换为两种新格式:WordFreak和嵌入式XML。我们跟踪了所花费的总时间以及自动化步骤的成功率。
重构后的语料库可在生物自然语言处理SourceForge网站http://bionlp.sourceforge.net上下载。所花费的总时间略超过三个人周,包括约102小时的编程时间(其中大部分是一次性开发成本)和20小时对自动输出的人工验证。此外,还介绍了重构任何语料库所需的步骤。
我们得出结论,对公开可用的语料库进行重构是一种在技术和经济上可行的方法,可增加已用于评估生物医学语言处理系统的数据的使用量。