在生物医学文献中查找缩写：三个生物医学信息交换格式（BioC）兼容模块和四个BioC格式语料库。

Finding abbreviations in biomedical literature: three BioC-compatible modules and four BioC-formatted corpora.

作者信息

Islamaj Doğan Rezarta, Comeau Donald C, Yeganova Lana, Wilbur W John

机构信息

National Center for Biotechnology Information, U.S. National Library of Medicine, National Institutes of Health, 8600 Rockville Pike, Bethesda, MD 20894, USA

National Center for Biotechnology Information, U.S. National Library of Medicine, National Institutes of Health, 8600 Rockville Pike, Bethesda, MD 20894, USA.

出版信息

Database (Oxford). 2014 Jun 9;2014. doi: 10.1093/database/bau044. Print 2014.

DOI:10.1093/database/bau044

PMID:24914232

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC4051513/

Abstract

BioC is a recently created XML format to share text data and annotations, and an accompanying input/output library to promote interoperability of data and tools for natural language processing of biomedical text. This article reports the use of BioC to address a common challenge in processing biomedical text information-that of frequent entity name abbreviation. We selected three different abbreviation definition identification modules, and used the publicly available BioC code to convert these independent modules into BioC-compatible components that interact seamlessly with BioC-formatted data, and other BioC-compatible modules. In addition, we consider four manually annotated corpora of abbreviations in biomedical text: the Ab3P corpus of 1250 PubMed abstracts, the BIOADI corpus of 1201 PubMed abstracts, the old MEDSTRACT corpus of 199 PubMed(®) citations and the Schwartz and Hearst corpus of 1000 PubMed abstracts. Annotations in these corpora have been re-evaluated by four annotators and their consistency and quality levels have been improved. We converted them to BioC-format and described the representation of the annotations. These corpora are used to measure the three abbreviation-finding algorithms and the results are given. The BioC-compatible modules, when compared with their original form, have no difference in their efficiency, running time or any other comparable aspects. They can be conveniently used as a common pre-processing step for larger multi-layered text-mining endeavors. Database URL: Code and data are available for download at the BioC site: http://bioc.sourceforge.net.

摘要

BioC是一种最近创建的用于共享文本数据和注释的XML格式，以及一个配套的输入/输出库，以促进生物医学文本自然语言处理的数据和工具的互操作性。本文报告了使用BioC来解决生物医学文本信息处理中的一个常见挑战——频繁的实体名称缩写问题。我们选择了三个不同的缩写定义识别模块，并使用公开可用的BioC代码将这些独立模块转换为与BioC兼容的组件，这些组件可与BioC格式的数据以及其他BioC兼容模块无缝交互。此外，我们考虑了生物医学文本中四个手动注释的缩写语料库：包含1250篇PubMed摘要的Ab3P语料库、包含1201篇PubMed摘要的BIOADI语料库、包含199篇PubMed(®)引用的旧MEDSTRACT语料库以及包含1000篇PubMed摘要的施瓦茨和赫斯特语料库。这些语料库中的注释已经由四位注释者重新评估，其一致性和质量水平得到了提高。我们将它们转换为BioC格式并描述了注释的表示形式。这些语料库用于衡量三种缩写查找算法，并给出了结果。与原始形式相比，BioC兼容模块在效率、运行时间或任何其他可比方面没有差异。它们可以方便地用作更大规模多层文本挖掘工作的通用预处理步骤。数据库网址：代码和数据可在BioC网站下载：http://bioc.sourceforge.net。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9215/4051513/3f69b5bdb116/bau044f1p.jpg

相似文献

Finding abbreviations in biomedical literature: three BioC-compatible modules and four BioC-formatted corpora.在生物医学文献中查找缩写：三个生物医学信息交换格式（BioC）兼容模块和四个BioC格式语料库。

Database (Oxford). 2014 Jun 9;2014. doi: 10.1093/database/bau044. Print 2014.

Auto-CORPus: A Natural Language Processing Tool for Standardizing and Reusing Biomedical Literature.自动语料库：一种用于规范和复用生物医学文献的自然语言处理工具。

Front Digit Health. 2022 Feb 15;4:788124. doi: 10.3389/fdgth.2022.788124. eCollection 2022.

BioC interoperability track overview.生物信息学互操作性赛道概述。

Database (Oxford). 2014 Jun 30;2014. doi: 10.1093/database/bau053. Print 2014.

Natural language processing pipelines to annotate BioC collections with an application to the NCBI disease corpus.用于注释BioC文集的自然语言处理管道及其在NCBI疾病语料库中的应用。

Database (Oxford). 2014 Jun 16;2014. doi: 10.1093/database/bau056. Print 2014.

BioC implementations in Go, Perl, Python and Ruby.用Go、Perl、Python和Ruby实现的BioC。

Database (Oxford). 2014 Jun 23;2014. doi: 10.1093/database/bau059. Print 2014.

iSimp in BioC standard format: enhancing the interoperability of a sentence simplification system.生物医学领域标准格式中的iSimp：增强句子简化系统的互操作性

Database (Oxford). 2014 May 21;2014. doi: 10.1093/database/bau038. Print 2014.

BioC: a minimalist approach to interoperability for biomedical text processing.BioC：一种用于生物医学文本处理的最小互操作方法。

Database (Oxford). 2013 Sep 18;2013:bat064. doi: 10.1093/database/bat064. Print 2013.

The BioC-BioGRID corpus: full text articles annotated for curation of protein-protein and genetic interactions.BioC-BioGRID语料库：为蛋白质-蛋白质和基因相互作用的编目而注释的全文文章。

Database (Oxford). 2017 Jan 10;2017. doi: 10.1093/database/baw147. Print 2017.

tmBioC: improving interoperability of text-mining tools with BioC.tmBioC：提高文本挖掘工具与BioC的互操作性。

Database (Oxford). 2014 Jul 25;2014. doi: 10.1093/database/bau073. Print 2014.

Machine learning with naturally labeled data for identifying abbreviation definitions.基于自然标注数据的机器学习在缩写词定义识别中的应用。

BMC Bioinformatics. 2011 Jun 9;12 Suppl 3(Suppl 3):S6. doi: 10.1186/1471-2105-12-S3-S6.

引用本文的文献

Overview of the BioCreative VI Precision Medicine Track: mining protein interactions and mutations for precision medicine.BioCreative VI 精准医学赛道概述：精准医学中的蛋白质相互作用和突变挖掘。

Database (Oxford). 2019 Jan 1;2019:bay147. doi: 10.1093/database/bay147.

Collaborative relation annotation and quality analysis in Markyt environment.马克提环境中的协作关系标注与质量分析。

Database (Oxford). 2017 Jan 1;2017. doi: 10.1093/database/bax090.

Literature evidence in open targets - a target validation platform.开放靶点（Open Targets）——一个靶点验证平台的文献证据。

J Biomed Semantics. 2017 Jun 6;8(1):20. doi: 10.1186/s13326-017-0131-3.

HITSZ_CDR: an end-to-end chemical and disease relation extraction system for BioCreative V.哈尔滨工业大学深圳校区的化学与疾病关系抽取系统（HITSZ_CDR）：用于生物创意竞赛V的端到端系统

Database (Oxford). 2016 Jun 5;2016. doi: 10.1093/database/baw077. Print 2016.

BioC interoperability track overview.生物信息学互操作性赛道概述。

Database (Oxford). 2014 Jun 30;2014. doi: 10.1093/database/bau053. Print 2014.

BioC implementations in Go, Perl, Python and Ruby.用Go、Perl、Python和Ruby实现的BioC。

Database (Oxford). 2014 Jun 23;2014. doi: 10.1093/database/bau059. Print 2014.

Database (Oxford). 2014 Jun 16;2014. doi: 10.1093/database/bau056. Print 2014.

本文引用的文献

BioC interoperability track overview.生物信息学互操作性赛道概述。

Database (Oxford). 2014 Jun 30;2014. doi: 10.1093/database/bau053. Print 2014.

BioC: a minimalist approach to interoperability for biomedical text processing.BioC：一种用于生物医学文本处理的最小互操作方法。

Database (Oxford). 2013 Sep 18;2013:bat064. doi: 10.1093/database/bat064. Print 2013.

BioCreative-2012 virtual issue.生物创意2012虚拟专刊。

Database (Oxford). 2012 Dec 5;2012:bas049. doi: 10.1093/database/bas049. Print 2012.

Overview of the BioCreative III Workshop.第三届生物创意研讨会概述。

BMC Bioinformatics. 2011 Oct 3;12 Suppl 8(Suppl 8):S1. doi: 10.1186/1471-2105-12-S8-S1.

Machine learning with naturally labeled data for identifying abbreviation definitions.基于自然标注数据的机器学习在缩写词定义识别中的应用。

BMC Bioinformatics. 2011 Jun 9;12 Suppl 3(Suppl 3):S6. doi: 10.1186/1471-2105-12-S3-S6.

Understanding PubMed user search behavior through log analysis.通过日志分析了解PubMed用户的搜索行为。

Database (Oxford). 2009;2009:bap018. doi: 10.1093/database/bap018. Epub 2009 Nov 27.

BIOADI: a machine learning approach to identifying abbreviations and definitions in biological literature.BIOADI：一种用于识别生物文献中缩写词和定义的机器学习方法。

BMC Bioinformatics. 2009 Dec 3;10 Suppl 15(Suppl 15):S7. doi: 10.1186/1471-2105-10-S15-S7.

Evaluation of text-mining systems for biology: overview of the Second BioCreative community challenge.生物学文本挖掘系统评估：第二届生物创意社区挑战赛概述

Genome Biol. 2008;9 Suppl 2(Suppl 2):S1. doi: 10.1186/gb-2008-9-s2-s1. Epub 2008 Sep 1.

Abbreviation definition identification based on automatic precision estimates.基于自动精度估计的缩写定义识别。

BMC Bioinformatics. 2008 Sep 25;9:402. doi: 10.1186/1471-2105-9-402.

Overview of BioCreAtIvE: critical assessment of information extraction for biology.生物创意（BioCreAtIvE）概述：生物学信息提取的批判性评估

BMC Bioinformatics. 2005;6 Suppl 1(Suppl 1):S1. doi: 10.1186/1471-2105-6-S1-S1. Epub 2005 May 24.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

在生物医学文献中查找缩写：三个生物医学信息交换格式（BioC）兼容模块和四个BioC格式语料库。

Finding abbreviations in biomedical literature: three BioC-compatible modules and four BioC-formatted corpora.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献