Islamaj Doğan Rezarta, Comeau Donald C, Yeganova Lana, Wilbur W John
National Center for Biotechnology Information, U.S. National Library of Medicine, National Institutes of Health, 8600 Rockville Pike, Bethesda, MD 20894, USA
National Center for Biotechnology Information, U.S. National Library of Medicine, National Institutes of Health, 8600 Rockville Pike, Bethesda, MD 20894, USA.
Database (Oxford). 2014 Jun 9;2014. doi: 10.1093/database/bau044. Print 2014.
BioC is a recently created XML format to share text data and annotations, and an accompanying input/output library to promote interoperability of data and tools for natural language processing of biomedical text. This article reports the use of BioC to address a common challenge in processing biomedical text information-that of frequent entity name abbreviation. We selected three different abbreviation definition identification modules, and used the publicly available BioC code to convert these independent modules into BioC-compatible components that interact seamlessly with BioC-formatted data, and other BioC-compatible modules. In addition, we consider four manually annotated corpora of abbreviations in biomedical text: the Ab3P corpus of 1250 PubMed abstracts, the BIOADI corpus of 1201 PubMed abstracts, the old MEDSTRACT corpus of 199 PubMed(®) citations and the Schwartz and Hearst corpus of 1000 PubMed abstracts. Annotations in these corpora have been re-evaluated by four annotators and their consistency and quality levels have been improved. We converted them to BioC-format and described the representation of the annotations. These corpora are used to measure the three abbreviation-finding algorithms and the results are given. The BioC-compatible modules, when compared with their original form, have no difference in their efficiency, running time or any other comparable aspects. They can be conveniently used as a common pre-processing step for larger multi-layered text-mining endeavors. Database URL: Code and data are available for download at the BioC site: http://bioc.sourceforge.net.
BioC是一种最近创建的用于共享文本数据和注释的XML格式,以及一个配套的输入/输出库,以促进生物医学文本自然语言处理的数据和工具的互操作性。本文报告了使用BioC来解决生物医学文本信息处理中的一个常见挑战——频繁的实体名称缩写问题。我们选择了三个不同的缩写定义识别模块,并使用公开可用的BioC代码将这些独立模块转换为与BioC兼容的组件,这些组件可与BioC格式的数据以及其他BioC兼容模块无缝交互。此外,我们考虑了生物医学文本中四个手动注释的缩写语料库:包含1250篇PubMed摘要的Ab3P语料库、包含1201篇PubMed摘要的BIOADI语料库、包含199篇PubMed(®)引用的旧MEDSTRACT语料库以及包含1000篇PubMed摘要的施瓦茨和赫斯特语料库。这些语料库中的注释已经由四位注释者重新评估,其一致性和质量水平得到了提高。我们将它们转换为BioC格式并描述了注释的表示形式。这些语料库用于衡量三种缩写查找算法,并给出了结果。与原始形式相比,BioC兼容模块在效率、运行时间或任何其他可比方面没有差异。它们可以方便地用作更大规模多层文本挖掘工作的通用预处理步骤。数据库网址:代码和数据可在BioC网站下载:http://bioc.sourceforge.net。