Savkov Aleksandar, Carroll John, Koeling Rob, Cassell Jackie
Department of Informatics, University of Sussex, Brighton, BN1 9QJ UK.
Division of Primary Care and Public Health, Brighton and Sussex Medical School, Brighton, BN1 9PH UK.
Lang Resour Eval. 2016;50:523-548. doi: 10.1007/s10579-015-9330-7. Epub 2016 Jan 11.
The free text notes typed by physicians during patient consultations contain valuable information for the study of disease and treatment. These notes are difficult to process by existing natural language analysis tools since they are highly telegraphic (omitting many words), and contain many spelling mistakes, inconsistencies in punctuation, and non-standard word order. To support information extraction and classification tasks over such text, we describe a de-identified corpus of free text notes, a shallow syntactic and named entity annotation scheme for this kind of text, and an approach to training domain specialists with no linguistic background to annotate the text. Finally, we present a statistical chunking system for such clinical text with a stable learning rate and good accuracy, indicating that the manual annotation is consistent and that the annotation scheme is tractable for machine learning.
医生在患者会诊期间键入的自由文本注释包含用于疾病研究和治疗的宝贵信息。这些注释难以被现有的自然语言分析工具处理,因为它们高度简洁(省略了许多单词),并且包含许多拼写错误、标点不一致以及非标准词序。为了支持对此类文本的信息提取和分类任务,我们描述了一个自由文本注释的去识别语料库、一种针对此类文本的浅层句法和命名实体注释方案,以及一种培训没有语言背景的领域专家对文本进行注释的方法。最后,我们提出了一个针对此类临床文本的统计分块系统,该系统具有稳定的学习率和良好的准确性,表明人工注释是一致的,并且该注释方案对于机器学习是易于处理的。