School of Information Technologies, University of Sydney, 1 Cleveland Street, Sydney 2006, NSW, Australia.
School of Information Technologies, University of Sydney, 1 Cleveland Street, Sydney 2006, NSW, Australia.
Artif Intell Med. 2015 May;64(1):41-50. doi: 10.1016/j.artmed.2015.03.001. Epub 2015 Mar 24.
To detect negations of medical entities in free-text pathology reports with different approaches, and evaluate their performances.
Three different approaches were applied for negation detection: the lexicon-based approach was a rule-based method, relying on trigger terms and termination clues; the syntax-based approach was also a rule-based method, where the rules and negation patterns were designed using the dependency output from the Stanford parser; the machine-learning-based approach used a support vector machine as a classifier to build models with a number of features. A total of 284 English pathology reports of lymphoma were used for the study.
The machine-learning-based approach had the best overall performance on the test set with micro-averaged F-score of 82.56%, while the syntax-based approach performed worst with 78.62% F-score. The lexicon-based approach attained an overall average precision of 89.74% and recall of 76.09%, which were significantly better than the results achieved by Negation Tagger with a similar approach.
The lexicon-based approach benefitted from being customized to the corpus more than the other two methods. The errors in negation detection with the syntax-based approach producing poorest performance were mainly due to the poor parsing results, and the errors with the other methods were probably because of the abnormal grammatical structures.
A machine-learning-based approach has potential advantages for negation detection, and may be preferable for the task. To improve the overall performance, one of the possible solutions is to apply different approaches to each section in the reports.
使用不同方法在自由文本病理学报告中检测医学实体的否定词,并评估它们的性能。
为了检测否定词,应用了三种不同的方法:基于词汇的方法是一种基于规则的方法,依赖于触发词和终止线索;基于语法的方法也是一种基于规则的方法,规则和否定模式是使用斯坦福解析器的依存输出设计的;基于机器学习的方法使用支持向量机作为分类器,使用许多特征构建模型。总共使用了 284 份英语淋巴瘤病理学报告进行研究。
基于机器学习的方法在测试集上的整体性能最佳,微平均 F 分数为 82.56%,而基于语法的方法表现最差,F 分数为 78.62%。基于词汇的方法整体平均精度为 89.74%,召回率为 76.09%,明显优于具有类似方法的 Negation Tagger 的结果。
基于词汇的方法比其他两种方法更受益于针对语料库进行定制。基于语法的方法产生最差性能的否定检测错误主要是由于解析结果不佳,而其他方法的错误可能是由于语法结构异常所致。
基于机器学习的方法在否定检测方面具有潜在优势,可能更适合该任务。为了提高整体性能,一种可能的解决方案是将不同的方法应用于报告的每个部分。