Pakhomov Serguei V, Coden Anni, Chute Christopher G
Division of Biomedical Informatics, Mayo College of Medicine, Rochester, MN 55905, USA.
Int J Med Inform. 2006 Jun;75(6):418-29. doi: 10.1016/j.ijmedinf.2005.08.006. Epub 2005 Sep 19.
This paper presents a project whose main goal is to construct a corpus of clinical text manually annotated for part-of-speech (POS) information. We describe and discuss the process of training three domain experts to perform linguistic annotation.
Three domain experts were trained to perform manual annotation of a corpus of clinical notes. A part of this corpus was combined with the Penn Treebank corpus of general purpose English text and another part was set aside for testing. The corpora were then used for training and testing statistical part-of-speech taggers. We list some of the challenges as well as encouraging results pertaining to inter-rater agreement and consistency of annotation.
We used the Trigrams'n'Tags (TnT) [T. Brants, TnT-a statistical part-of-speech tagger, In: Proceedings of NAACL/ANLP-2000 Symposium, 2000] tagger trained on general English data to achieve 89.79% correctness. The same tagger trained on a portion of the medical data annotated for this project improved the performance to 94.69%. Furthermore, we find that discriminating between different types of discourse represented by different sections of clinical text may be very beneficial to improve correctness of POS tagging.
Our preliminary experimental results indicate the necessity for adapting state-of-the-art POS taggers to the sublanguage domain of clinical text.
本文介绍了一个项目,其主要目标是构建一个经过词性(POS)信息人工标注的临床文本语料库。我们描述并讨论了培训三位领域专家进行语言标注的过程。
培训三位领域专家对临床笔记语料库进行人工标注。该语料库的一部分与通用英语文本的宾州树库语料库相结合,另一部分留作测试之用。然后将这些语料库用于训练和测试统计词性标注器。我们列出了一些与评分者间一致性和标注一致性相关的挑战以及令人鼓舞的结果。
我们使用在通用英语数据上训练的三元组与标签(TnT)[T. 布兰特斯,TnT - 一个统计词性标注器,载于:《NAACL/ANLP - 2000研讨会论文集》,2000年]标注器,其正确率达到89.79%。在为本项目标注的部分医学数据上训练的同一标注器将性能提高到了94.69%。此外,我们发现区分由临床文本不同部分所代表的不同类型话语可能对提高词性标注的正确率非常有益。
我们的初步实验结果表明,有必要使最先进的词性标注器适应临床文本的子语言领域。