Suppr超能文献

在生物医学文献中查找缩写:三个生物医学信息交换格式(BioC)兼容模块和四个BioC格式语料库。

Finding abbreviations in biomedical literature: three BioC-compatible modules and four BioC-formatted corpora.

作者信息

Islamaj Doğan Rezarta, Comeau Donald C, Yeganova Lana, Wilbur W John

机构信息

National Center for Biotechnology Information, U.S. National Library of Medicine, National Institutes of Health, 8600 Rockville Pike, Bethesda, MD 20894, USA

National Center for Biotechnology Information, U.S. National Library of Medicine, National Institutes of Health, 8600 Rockville Pike, Bethesda, MD 20894, USA.

出版信息

Database (Oxford). 2014 Jun 9;2014. doi: 10.1093/database/bau044. Print 2014.

Abstract

BioC is a recently created XML format to share text data and annotations, and an accompanying input/output library to promote interoperability of data and tools for natural language processing of biomedical text. This article reports the use of BioC to address a common challenge in processing biomedical text information-that of frequent entity name abbreviation. We selected three different abbreviation definition identification modules, and used the publicly available BioC code to convert these independent modules into BioC-compatible components that interact seamlessly with BioC-formatted data, and other BioC-compatible modules. In addition, we consider four manually annotated corpora of abbreviations in biomedical text: the Ab3P corpus of 1250 PubMed abstracts, the BIOADI corpus of 1201 PubMed abstracts, the old MEDSTRACT corpus of 199 PubMed(®) citations and the Schwartz and Hearst corpus of 1000 PubMed abstracts. Annotations in these corpora have been re-evaluated by four annotators and their consistency and quality levels have been improved. We converted them to BioC-format and described the representation of the annotations. These corpora are used to measure the three abbreviation-finding algorithms and the results are given. The BioC-compatible modules, when compared with their original form, have no difference in their efficiency, running time or any other comparable aspects. They can be conveniently used as a common pre-processing step for larger multi-layered text-mining endeavors. Database URL: Code and data are available for download at the BioC site: http://bioc.sourceforge.net.

摘要

BioC是一种最近创建的用于共享文本数据和注释的XML格式,以及一个配套的输入/输出库,以促进生物医学文本自然语言处理的数据和工具的互操作性。本文报告了使用BioC来解决生物医学文本信息处理中的一个常见挑战——频繁的实体名称缩写问题。我们选择了三个不同的缩写定义识别模块,并使用公开可用的BioC代码将这些独立模块转换为与BioC兼容的组件,这些组件可与BioC格式的数据以及其他BioC兼容模块无缝交互。此外,我们考虑了生物医学文本中四个手动注释的缩写语料库:包含1250篇PubMed摘要的Ab3P语料库、包含1201篇PubMed摘要的BIOADI语料库、包含199篇PubMed(®)引用的旧MEDSTRACT语料库以及包含1000篇PubMed摘要的施瓦茨和赫斯特语料库。这些语料库中的注释已经由四位注释者重新评估,其一致性和质量水平得到了提高。我们将它们转换为BioC格式并描述了注释的表示形式。这些语料库用于衡量三种缩写查找算法,并给出了结果。与原始形式相比,BioC兼容模块在效率、运行时间或任何其他可比方面没有差异。它们可以方便地用作更大规模多层文本挖掘工作的通用预处理步骤。数据库网址:代码和数据可在BioC网站下载:http://bioc.sourceforge.net。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9215/4051513/3f69b5bdb116/bau044f1p.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验