State Key Laboratory of Software Development Environment, Key Laboratory of Biomechanics and Mechanobiology of Ministry of Education, Beihang University, Beijing, China.
J Am Med Inform Assoc. 2014 Feb;21(e1):e84-92. doi: 10.1136/amiajnl-2013-001806. Epub 2013 Aug 9.
In this paper, we focus on three aspects: (1) to annotate a set of standard corpus in Chinese discharge summaries; (2) to perform word segmentation and named entity recognition in the above corpus; (3) to build a joint model that performs word segmentation and named entity recognition.
Two independent systems of word segmentation and named entity recognition were built based on conditional random field models. In the field of natural language processing, while most approaches use a single model to predict outputs, many works have proved that performance of many tasks can be improved by exploiting combined techniques. Therefore, in this paper, we proposed a joint model using dual decomposition to perform both the two tasks in order to exploit correlations between the two tasks. Three sets of features were designed to demonstrate the advantage of the joint model we proposed, compared with independent models, incremental models and a joint model trained on combined labels.
Micro-averaged precision (P), recall (R), and F-measure (F) were used to evaluate results.
The gold standard corpus is created using 336 Chinese discharge summaries of 71 355 words. The framework using dual decomposition achieved 0.2% improvement for segmentation and 1% improvement for recognition, compared with each of the two tasks alone.
The joint model is efficient and effective in both segmentation and recognition compared with the two individual tasks. The model achieved encouraging results, demonstrating the feasibility of the two tasks.
本文重点关注三个方面:(1)标注一组中文出院小结标准语料库;(2)在上述语料库中进行分词和命名实体识别;(3)构建一个联合模型,同时进行分词和命名实体识别。
基于条件随机场模型构建了两个独立的分词和命名实体识别系统。在自然语言处理领域,虽然大多数方法使用单个模型来预测输出,但许多工作已经证明,通过利用组合技术,可以提高许多任务的性能。因此,本文提出了一种联合模型,使用双重分解来执行这两个任务,以利用两个任务之间的相关性。设计了三组特征,以证明与独立模型、增量模型和基于组合标签训练的联合模型相比,我们提出的联合模型的优势。
使用微平均精度(P)、召回率(R)和 F 度量(F)来评估结果。
使用 336 份包含 71355 个单词的中文出院小结创建了黄金标准语料库。与两个独立任务相比,使用双重分解的框架在分词方面提高了 0.2%,在识别方面提高了 1%。
与两个独立任务相比,联合模型在分词和识别方面都更高效、更有效。该模型取得了令人鼓舞的结果,证明了这两个任务的可行性。