对生命科学文献中句子和词元分割的重新评估。

A reappraisal of sentence and token splitting for life sciences documents.

作者信息

Tomanek Katrin, Wermter Joachim, Hahn Udo

机构信息

Jena University Language and Information Engineering (JULIE) Lab, Friedrich-Schiller-Universität Jena, Germany.

出版信息

Stud Health Technol Inform. 2007;129(Pt 1):524-8.

PMID:17911772

Abstract

Natural language processing of real-world documents requires several low-level tasks such as splitting a piece of text into its constituent sentences, and splitting each sentence into its constituent tokens to be performed by some preprocessor (prior to linguistic analysis). While this task is often considered as unsophisticated clerical work, in the life sciences domain it poses enormous problems due to complex naming conventions. In this paper, we first introduce an annotation framework for sentence and token splitting underlying a newly constructed sentence- and token-tagged biomedical text corpus. This corpus serves as a training environment and test bed for machine-learning based sentence and token splitters using Conditional Random Fields (CRFs). Our evaluation experiments reveal that CRFs with a rich feature set substantially increase sentence and token detection performance.

摘要

对真实世界文档进行自然语言处理需要执行多个低级任务，例如将一段文本拆分为其组成句子，以及将每个句子拆分为其组成词元，这些任务要由某个预处理器（在进行语言分析之前）来执行。虽然这项任务通常被视为简单的文书工作，但在生命科学领域，由于复杂的命名惯例，它带来了巨大的问题。在本文中，我们首先介绍一个用于句子和词元拆分的标注框架，该框架是一个新构建的带有句子和词元标签的生物医学文本语料库的基础。这个语料库用作基于条件随机场（CRF）的机器学习句子和词元拆分器的训练环境和测试平台。我们的评估实验表明，具有丰富特征集的CRF能大幅提高句子和词元检测性能。