Uzuner Ozlem, Zhang Xiaoran, Sibanda Tawanda
Information Studies, State Unviersity of New York, Albany, NY, USA.
J Am Med Inform Assoc. 2009 Jan-Feb;16(1):109-15. doi: 10.1197/jamia.M2950. Epub 2008 Oct 24.
The authors study two approaches to assertion classification. One of these approaches, Extended NegEx (ENegEx), extends the rule-based NegEx algorithm to cover alter-association assertions; the other, Statistical Assertion Classifier (StAC), presents a machine learning solution to assertion classification.
For each mention of each medical problem, both approaches determine whether the problem, as asserted by the context of that mention, is present, absent, or uncertain in the patient, or associated with someone other than the patient. The authors use these two systems to (1) extend negation and uncertainty extraction to recognition of alter-association assertions, (2) determine the contribution of lexical and syntactic context to assertion classification, and (3) test if a machine learning approach to assertion classification can be as generally applicable and useful as its rule-based counterparts.
The authors evaluated assertion classification approaches with precision, recall, and F-measure.
The ENegEx algorithm is a general algorithm that can be directly applied to new corpora. Despite being based on machine learning, StAC can also be applied out-of-the-box to new corpora and achieve similar generality.
The StAC models that are developed on discharge summaries can be successfully applied to radiology reports. These models benefit the most from words found in the +/- 4 word window of the target and can outperform ENegEx.
作者研究了两种断言分类方法。其中一种方法,扩展否定词检测法(ENegEx),将基于规则的否定词检测算法进行扩展,以涵盖替代关联断言;另一种方法,统计断言分类器(StAC),提出了一种用于断言分类的机器学习解决方案。
对于每个医学问题的每次提及,两种方法都要确定在该提及的上下文中所断言的问题在患者身上是存在、不存在、不确定,还是与患者以外的其他人相关。作者使用这两个系统来(1)将否定和不确定性提取扩展到替代关联断言的识别,(2)确定词汇和句法上下文对断言分类的贡献,以及(3)测试用于断言分类的机器学习方法是否能与基于规则的方法一样普遍适用且有用。
作者用精确率、召回率和F值来评估断言分类方法。
ENegEx算法是一种可直接应用于新语料库的通用算法。尽管StAC基于机器学习,但它也可以直接应用于新语料库并实现类似的通用性。
在出院小结上开发的StAC模型可以成功应用于放射学报告。这些模型从目标词前后4个词窗口内的词中受益最大,并且性能优于ENegEx。