National Center for Biotechnology Information, NLM, NIH, Bethesda, MD, USA.
BMC Bioinformatics. 2011 Jun 9;12 Suppl 3(Suppl 3):S6. doi: 10.1186/1471-2105-12-S3-S6.
The rapid growth of biomedical literature requires accurate text analysis and text processing tools. Detecting abbreviations and identifying their definitions is an important component of such tools. Most existing approaches for the abbreviation definition identification task employ rule-based methods. While achieving high precision, rule-based methods are limited to the rules defined and fail to capture many uncommon definition patterns. Supervised learning techniques, which offer more flexibility in detecting abbreviation definitions, have also been applied to the problem. However, they require manually labeled training data.
In this work, we develop a machine learning algorithm for abbreviation definition identification in text which makes use of what we term naturally labeled data. Positive training examples are naturally occurring potential abbreviation-definition pairs in text. Negative training examples are generated by randomly mixing potential abbreviations with unrelated potential definitions. The machine learner is trained to distinguish between these two sets of examples. Then, the learned feature weights are used to identify the abbreviation full form. This approach does not require manually labeled training data.
We evaluate the performance of our algorithm on the Ab3P, BIOADI and Medstract corpora. Our system demonstrated results that compare favourably to the existing Ab3P and BIOADI systems. We achieve an F-measure of 91.36% on Ab3P corpus, and an F-measure of 87.13% on BIOADI corpus which are superior to the results reported by Ab3P and BIOADI systems. Moreover, we outperform these systems in terms of recall, which is one of our goals.
生物医学文献的快速增长需要准确的文本分析和文本处理工具。检测缩写词并识别其定义是此类工具的重要组成部分。大多数现有的缩写词定义识别任务的方法都采用基于规则的方法。基于规则的方法虽然精度高,但受到定义规则的限制,无法捕捉许多不常见的定义模式。也已经将监督学习技术应用于该问题,该技术在检测缩写词定义方面具有更大的灵活性。但是,它们需要手动标记的训练数据。
在这项工作中,我们开发了一种用于文本中缩写词定义识别的机器学习算法,该算法利用我们所谓的自然标记数据。阳性训练示例是文本中自然出现的潜在缩写-定义对。阴性训练示例是通过随机混合潜在缩写和不相关的潜在定义生成的。机器学习者被训练来区分这两组示例。然后,使用学习到的特征权重来识别缩写的完整形式。此方法不需要手动标记的训练数据。
我们在 Ab3P、BIOADI 和 Medstract 语料库上评估了我们算法的性能。我们的系统在 Ab3P 和 BIOADI 系统上的表现优于现有的 Ab3P 和 BIOADI 系统。我们在 Ab3P 语料库上的 F 度量达到 91.36%,在 BIOADI 语料库上的 F 度量达到 87.13%,优于 Ab3P 和 BIOADI 系统报告的结果。此外,我们在召回率方面优于这些系统,这是我们的目标之一。