Suppr超能文献

基于自然标注数据的机器学习在缩写词定义识别中的应用。

Machine learning with naturally labeled data for identifying abbreviation definitions.

机构信息

National Center for Biotechnology Information, NLM, NIH, Bethesda, MD, USA.

出版信息

BMC Bioinformatics. 2011 Jun 9;12 Suppl 3(Suppl 3):S6. doi: 10.1186/1471-2105-12-S3-S6.

Abstract

BACKGROUND

The rapid growth of biomedical literature requires accurate text analysis and text processing tools. Detecting abbreviations and identifying their definitions is an important component of such tools. Most existing approaches for the abbreviation definition identification task employ rule-based methods. While achieving high precision, rule-based methods are limited to the rules defined and fail to capture many uncommon definition patterns. Supervised learning techniques, which offer more flexibility in detecting abbreviation definitions, have also been applied to the problem. However, they require manually labeled training data.

METHODS

In this work, we develop a machine learning algorithm for abbreviation definition identification in text which makes use of what we term naturally labeled data. Positive training examples are naturally occurring potential abbreviation-definition pairs in text. Negative training examples are generated by randomly mixing potential abbreviations with unrelated potential definitions. The machine learner is trained to distinguish between these two sets of examples. Then, the learned feature weights are used to identify the abbreviation full form. This approach does not require manually labeled training data.

RESULTS

We evaluate the performance of our algorithm on the Ab3P, BIOADI and Medstract corpora. Our system demonstrated results that compare favourably to the existing Ab3P and BIOADI systems. We achieve an F-measure of 91.36% on Ab3P corpus, and an F-measure of 87.13% on BIOADI corpus which are superior to the results reported by Ab3P and BIOADI systems. Moreover, we outperform these systems in terms of recall, which is one of our goals.

摘要

背景

生物医学文献的快速增长需要准确的文本分析和文本处理工具。检测缩写词并识别其定义是此类工具的重要组成部分。大多数现有的缩写词定义识别任务的方法都采用基于规则的方法。基于规则的方法虽然精度高,但受到定义规则的限制,无法捕捉许多不常见的定义模式。也已经将监督学习技术应用于该问题,该技术在检测缩写词定义方面具有更大的灵活性。但是,它们需要手动标记的训练数据。

方法

在这项工作中,我们开发了一种用于文本中缩写词定义识别的机器学习算法,该算法利用我们所谓的自然标记数据。阳性训练示例是文本中自然出现的潜在缩写-定义对。阴性训练示例是通过随机混合潜在缩写和不相关的潜在定义生成的。机器学习者被训练来区分这两组示例。然后,使用学习到的特征权重来识别缩写的完整形式。此方法不需要手动标记的训练数据。

结果

我们在 Ab3P、BIOADI 和 Medstract 语料库上评估了我们算法的性能。我们的系统在 Ab3P 和 BIOADI 系统上的表现优于现有的 Ab3P 和 BIOADI 系统。我们在 Ab3P 语料库上的 F 度量达到 91.36%,在 BIOADI 语料库上的 F 度量达到 87.13%,优于 Ab3P 和 BIOADI 系统报告的结果。此外,我们在召回率方面优于这些系统,这是我们的目标之一。

相似文献

7

本文引用的文献

1
Understanding PubMed user search behavior through log analysis.通过日志分析了解PubMed用户的搜索行为。
Database (Oxford). 2009;2009:bap018. doi: 10.1093/database/bap018. Epub 2009 Nov 27.
4
ADAM: another database of abbreviations in MEDLINE.ADAM:医学在线数据库(MEDLINE)中的另一个缩写词数据库。
Bioinformatics. 2006 Nov 15;22(22):2813-8. doi: 10.1093/bioinformatics/btl480. Epub 2006 Sep 18.
5
MedPost: a part-of-speech tagger for bioMedical text.MedPost:一种用于生物医学文本的词性标注器。
Bioinformatics. 2004 Sep 22;20(14):2320-1. doi: 10.1093/bioinformatics/bth227. Epub 2004 Apr 8.

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验