Medical Informatics, Kaiser Permanente Southern California, Pasadena, California, USA.
J Am Med Inform Assoc. 2013 Nov-Dec;20(6):1168-77. doi: 10.1136/amiajnl-2013-001810. Epub 2013 Aug 1.
To develop, evaluate, and share: (1) syntactic parsing guidelines for clinical text, with a new approach to handling ill-formed sentences; and (2) a clinical Treebank annotated according to the guidelines. To document the process and findings for readers with similar interest.
Using random samples from a shared natural language processing challenge dataset, we developed a handbook of domain-customized syntactic parsing guidelines based on iterative annotation and adjudication between two institutions. Special considerations were incorporated into the guidelines for handling ill-formed sentences, which are common in clinical text. Intra- and inter-annotator agreement rates were used to evaluate consistency in following the guidelines. Quantitative and qualitative properties of the annotated Treebank, as well as its use to retrain a statistical parser, were reported.
A supplement to the Penn Treebank II guidelines was developed for annotating clinical sentences. After three iterations of annotation and adjudication on 450 sentences, the annotators reached an F-measure agreement rate of 0.930 (while intra-annotator rate was 0.948) on a final independent set. A total of 1100 sentences from progress notes were annotated that demonstrated domain-specific linguistic features. A statistical parser retrained with combined general English (mainly news text) annotations and our annotations achieved an accuracy of 0.811 (higher than models trained purely with either general or clinical sentences alone). Both the guidelines and syntactic annotations are made available at https://sourceforge.net/projects/medicaltreebank.
We developed guidelines for parsing clinical text and annotated a corpus accordingly. The high intra- and inter-annotator agreement rates showed decent consistency in following the guidelines. The corpus was shown to be useful in retraining a statistical parser that achieved moderate accuracy.
制定、评估和分享:(1)临床文本的句法分析指南,采用处理不规则句子的新方法;(2)根据指南标注的临床 Treebank。为有类似兴趣的读者记录这一过程和发现。
使用来自共享自然语言处理挑战数据集的随机样本,我们根据两个机构之间的迭代注释和裁决,制定了一本领域定制的句法分析指南手册。在指南中纳入了处理不规则句子的特殊考虑因素,不规则句子在临床文本中很常见。使用内部和外部注释者的一致性率来评估遵循指南的一致性。报告了标注 Treebank 的定量和定性属性,以及其用于重新训练统计解析器的用途。
为标注临床句子开发了 Penn Treebank II 指南的补充内容。在对 450 个句子进行三轮注释和裁决后,注释者在最终的独立集上达到了 0.930 的 F 度量一致性率(而内部注释者的比率为 0.948)。总共对来自进度记录的 1100 个句子进行了标注,展示了特定于领域的语言特征。用结合通用英语(主要是新闻文本)注释和我们的注释重新训练的统计解析器的准确性达到 0.811(高于仅使用通用或临床句子单独训练的模型)。指南和句法注释都可在 https://sourceforge.net/projects/medicaltreebank 上获得。
我们制定了用于解析临床文本的指南,并相应地标注了语料库。高的内部和外部注释者一致性率表明遵循指南的一致性较好。该语料库被证明可用于重新训练统计解析器,达到中等准确性。