School of IT, The University of Sydney, Sydney, Australia.
J Am Med Inform Assoc. 2011 Sep-Oct;18(5):574-9. doi: 10.1136/amiajnl-2011-000302. Epub 2011 Jul 7.
Information extraction and classification of clinical data are current challenges in natural language processing. This paper presents a cascaded method to deal with three different extractions and classifications in clinical data: concept annotation, assertion classification and relation classification.
A pipeline system was developed for clinical natural language processing that includes a proofreading process, with gold-standard reflexive validation and correction. The information extraction system is a combination of a machine learning approach and a rule-based approach. The outputs of this system are used for evaluation in all three tiers of the fourth i2b2/VA shared-task and workshop challenge.
Overall concept classification attained an F-score of 83.3% against a baseline of 77.0%, the optimal F-score for assertions about the concepts was 92.4% and relation classifier attained 72.6% for relationships between clinical concepts against a baseline of 71.0%. Micro-average results for the challenge test set were 81.79%, 91.90% and 70.18%, respectively.
The challenge in the multi-task test requires a distribution of time and work load for each individual task so that the overall performance evaluation on all three tasks would be more informative rather than treating each task assessment as independent. The simplicity of the model developed in this work should be contrasted with the very large feature space of other participants in the challenge who only achieved slightly better performance. There is a need to charge a penalty against the complexity of a model as defined in message minimalisation theory when comparing results.
A complete pipeline system for constructing language processing models that can be used to process multiple practical detection tasks of language structures of clinical records is presented.
信息提取和分类是自然语言处理中的当前挑战。本文提出了一种级联方法,用于处理临床数据中的三种不同的提取和分类:概念标注、断言分类和关系分类。
开发了一种用于临床自然语言处理的流水线系统,包括一个校对过程,具有黄金标准的自反验证和纠正。信息提取系统是机器学习方法和基于规则的方法的组合。该系统的输出用于评估第四 i2b2/VA 共享任务和研讨会挑战赛的所有三个层次。
整体概念分类的 F 得分为 83.3%,基线为 77.0%,概念断言的最佳 F 得分为 92.4%,临床概念之间的关系分类器的 F 得分为 72.6%,基线为 71.0%。挑战赛测试集的微平均结果分别为 81.79%、91.90%和 70.18%。
多任务测试的挑战需要为每个单独的任务分配时间和工作负载,以便对所有三个任务的整体性能评估更具信息性,而不是将每个任务评估视为独立的。与挑战赛中其他仅取得略好性能的参与者相比,本工作中开发的模型的简单性应与非常大的特征空间形成对比。在比较结果时,需要根据消息最小化理论对模型的复杂性进行惩罚。
提出了一种完整的流水线系统,用于构建语言处理模型,可用于处理临床记录语言结构的多个实际检测任务。