基于自然标注数据的机器学习在缩写词定义识别中的应用。

Machine learning with naturally labeled data for identifying abbreviation definitions.

机构信息

National Center for Biotechnology Information, NLM, NIH, Bethesda, MD, USA.

出版信息

BMC Bioinformatics. 2011 Jun 9;12 Suppl 3(Suppl 3):S6. doi: 10.1186/1471-2105-12-S3-S6.

DOI:10.1186/1471-2105-12-S3-S6

PMID:21658293

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC3111592/

Abstract

BACKGROUND

The rapid growth of biomedical literature requires accurate text analysis and text processing tools. Detecting abbreviations and identifying their definitions is an important component of such tools. Most existing approaches for the abbreviation definition identification task employ rule-based methods. While achieving high precision, rule-based methods are limited to the rules defined and fail to capture many uncommon definition patterns. Supervised learning techniques, which offer more flexibility in detecting abbreviation definitions, have also been applied to the problem. However, they require manually labeled training data.

METHODS

In this work, we develop a machine learning algorithm for abbreviation definition identification in text which makes use of what we term naturally labeled data. Positive training examples are naturally occurring potential abbreviation-definition pairs in text. Negative training examples are generated by randomly mixing potential abbreviations with unrelated potential definitions. The machine learner is trained to distinguish between these two sets of examples. Then, the learned feature weights are used to identify the abbreviation full form. This approach does not require manually labeled training data.

RESULTS

We evaluate the performance of our algorithm on the Ab3P, BIOADI and Medstract corpora. Our system demonstrated results that compare favourably to the existing Ab3P and BIOADI systems. We achieve an F-measure of 91.36% on Ab3P corpus, and an F-measure of 87.13% on BIOADI corpus which are superior to the results reported by Ab3P and BIOADI systems. Moreover, we outperform these systems in terms of recall, which is one of our goals.

摘要

背景

生物医学文献的快速增长需要准确的文本分析和文本处理工具。检测缩写词并识别其定义是此类工具的重要组成部分。大多数现有的缩写词定义识别任务的方法都采用基于规则的方法。基于规则的方法虽然精度高，但受到定义规则的限制，无法捕捉许多不常见的定义模式。也已经将监督学习技术应用于该问题，该技术在检测缩写词定义方面具有更大的灵活性。但是，它们需要手动标记的训练数据。

方法

在这项工作中，我们开发了一种用于文本中缩写词定义识别的机器学习算法，该算法利用我们所谓的自然标记数据。阳性训练示例是文本中自然出现的潜在缩写-定义对。阴性训练示例是通过随机混合潜在缩写和不相关的潜在定义生成的。机器学习者被训练来区分这两组示例。然后，使用学习到的特征权重来识别缩写的完整形式。此方法不需要手动标记的训练数据。

结果

我们在 Ab3P、BIOADI 和 Medstract 语料库上评估了我们算法的性能。我们的系统在 Ab3P 和 BIOADI 系统上的表现优于现有的 Ab3P 和 BIOADI 系统。我们在 Ab3P 语料库上的 F 度量达到 91.36%，在 BIOADI 语料库上的 F 度量达到 87.13%，优于 Ab3P 和 BIOADI 系统报告的结果。此外，我们在召回率方面优于这些系统，这是我们的目标之一。

相似文献

Machine learning with naturally labeled data for identifying abbreviation definitions.基于自然标注数据的机器学习在缩写词定义识别中的应用。

BMC Bioinformatics. 2011 Jun 9;12 Suppl 3(Suppl 3):S6. doi: 10.1186/1471-2105-12-S3-S6.

BIOADI: a machine learning approach to identifying abbreviations and definitions in biological literature.BIOADI：一种用于识别生物文献中缩写词和定义的机器学习方法。

BMC Bioinformatics. 2009 Dec 3;10 Suppl 15(Suppl 15):S7. doi: 10.1186/1471-2105-10-S15-S7.

Abbreviation definition identification based on automatic precision estimates.基于自动精度估计的缩写定义识别。

BMC Bioinformatics. 2008 Sep 25;9:402. doi: 10.1186/1471-2105-9-402.

Finding abbreviations in biomedical literature: three BioC-compatible modules and four BioC-formatted corpora.在生物医学文献中查找缩写：三个生物医学信息交换格式（BioC）兼容模块和四个BioC格式语料库。

Database (Oxford). 2014 Jun 9;2014. doi: 10.1093/database/bau044. Print 2014.

MBA: a literature mining system for extracting biomedical abbreviations.MBA：一种用于提取生物医学缩写的文献挖掘系统。

BMC Bioinformatics. 2009 Jan 9;10:14. doi: 10.1186/1471-2105-10-14.

Using MEDLINE as a knowledge source for disambiguating abbreviations and acronyms in full-text biomedical journal articles.使用MEDLINE作为知识来源来消除全文生物医学期刊文章中缩写词和首字母缩略词的歧义。

J Biomed Inform. 2007 Apr;40(2):150-9. doi: 10.1016/j.jbi.2006.06.001. Epub 2006 Jun 7.

ALICE: an algorithm to extract abbreviations from MEDLINE.ALICE：一种从医学文献数据库（MEDLINE）中提取缩写词的算法。

J Am Med Inform Assoc. 2005 Sep-Oct;12(5):576-86. doi: 10.1197/jamia.M1757. Epub 2005 May 19.

Creating an online dictionary of abbreviations from MEDLINE.创建一个来自医学文献数据库（MEDLINE）的缩写在线词典。

J Am Med Inform Assoc. 2002 Nov-Dec;9(6):612-20. doi: 10.1197/jamia.m1139.

Towards Comprehensive Clinical Abbreviation Disambiguation Using Machine-Labeled Training Data.利用机器标注训练数据实现临床缩写词的全面消歧

AMIA Annu Symp Proc. 2017 Feb 10;2016:560-569. eCollection 2016.

Detecting abbreviations in discharge summaries using machine learning methods.使用机器学习方法检测出院小结中的缩写词。

AMIA Annu Symp Proc. 2011;2011:1541-9. Epub 2011 Oct 22.

引用本文的文献

BioC interoperability track overview.生物信息学互操作性赛道概述。

Database (Oxford). 2014 Jun 30;2014. doi: 10.1093/database/bau053. Print 2014.

BioC implementations in Go, Perl, Python and Ruby.用Go、Perl、Python和Ruby实现的BioC。

Database (Oxford). 2014 Jun 23;2014. doi: 10.1093/database/bau059. Print 2014.

Database (Oxford). 2014 Jun 9;2014. doi: 10.1093/database/bau044. Print 2014.

Evaluation and cross-comparison of lexical entities of biological interest (LexEBI).生物相关词汇实体的评估和交叉比较（LexEBI）。

PLoS One. 2013 Oct 4;8(10):e75185. doi: 10.1371/journal.pone.0075185. eCollection 2013.

BioC: a minimalist approach to interoperability for biomedical text processing.BioC：一种用于生物医学文本处理的最小互操作方法。

Database (Oxford). 2013 Sep 18;2013:bat064. doi: 10.1093/database/bat064. Print 2013.

Topics in machine learning for biomedical literature analysis and text retrieval.用于生物医学文献分析和文本检索的机器学习主题。

BMC Bioinformatics. 2011 Jun 9;12 Suppl 3(Suppl 3):I1. doi: 10.1186/1471-2105-12-S3-I1.

本文引用的文献

Understanding PubMed user search behavior through log analysis.通过日志分析了解PubMed用户的搜索行为。

Database (Oxford). 2009;2009:bap018. doi: 10.1093/database/bap018. Epub 2009 Nov 27.

BIOADI: a machine learning approach to identifying abbreviations and definitions in biological literature.BIOADI：一种用于识别生物文献中缩写词和定义的机器学习方法。

BMC Bioinformatics. 2009 Dec 3;10 Suppl 15(Suppl 15):S7. doi: 10.1186/1471-2105-10-S15-S7.

Abbreviation definition identification based on automatic precision estimates.基于自动精度估计的缩写定义识别。

BMC Bioinformatics. 2008 Sep 25;9:402. doi: 10.1186/1471-2105-9-402.

ADAM: another database of abbreviations in MEDLINE.ADAM：医学在线数据库（MEDLINE）中的另一个缩写词数据库。

Bioinformatics. 2006 Nov 15;22(22):2813-8. doi: 10.1093/bioinformatics/btl480. Epub 2006 Sep 18.

MedPost: a part-of-speech tagger for bioMedical text.MedPost：一种用于生物医学文本的词性标注器。

Bioinformatics. 2004 Sep 22;20(14):2320-1. doi: 10.1093/bioinformatics/bth227. Epub 2004 Apr 8.

A simple algorithm for identifying abbreviation definitions in biomedical text.一种用于识别生物医学文本中缩写定义的简单算法。

Pac Symp Biocomput. 2003:451-62.

A study of abbreviations in MEDLINE abstracts.一项关于医学在线数据库（MEDLINE）摘要中缩写词的研究。

Proc AMIA Symp. 2002:464-8.

Creating an online dictionary of abbreviations from MEDLINE.创建一个来自医学文献数据库（MEDLINE）的缩写在线词典。

J Am Med Inform Assoc. 2002 Nov-Dec;9(6):612-20. doi: 10.1197/jamia.m1139.

Mapping abbreviations to full forms in biomedical articles.在生物医学文章中将缩写词映射为全称。

J Am Med Inform Assoc. 2002 May-Jun;9(3):262-72. doi: 10.1197/jamia.m0913.

Automatic extraction of acronym-meaning pairs from MEDLINE databases.从医学文献数据库中自动提取首字母缩略词及其含义对。

Stud Health Technol Inform. 2001;84(Pt 1):371-5.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验