Vincze Veronika, Szarvas György, Móra György, Ohta Tomoko, Farkas Richárd
Research Group on Artificial Intelligence, Hungarian Academy of Sciences, Szeged, Hungary.
J Biomed Semantics. 2011 Oct 6;2 Suppl 5(Suppl 5):S8. doi: 10.1186/2041-1480-2-S5-S8.
The treatment of negation and hedging in natural language processing has received much interest recently, especially in the biomedical domain. However, open access corpora annotated for negation and/or speculation are hardly available for training and testing applications, and even if they are, they sometimes follow different design principles. In this paper, the annotation principles of the two largest corpora containing annotation for negation and speculation - BioScope and Genia Event - are compared. BioScope marks linguistic cues and their scopes for negation and hedging while in Genia biological events are marked for uncertainty and/or negation.
Differences among the annotations of the two corpora are thematically categorized and the frequency of each category is estimated. We found that the largest amount of differences is due to the issue that scopes - which cover text spans - deal with the key events and each argument (including events within events) of these events is under the scope as well. In contrast, Genia deals with the modality of events within events independently.
The analysis of multiple layers of annotation (linguistic scopes and biological events) showed that the detection of negation/hedge keywords and their scopes can contribute to determining the modality of key events (denoted by the main predicate). On the other hand, for the detection of the negation and speculation status of events within events, additional syntax-based rules investigating the dependency path between the modality cue and the event cue have to be employed.
自然语言处理中对否定和模糊限制语的处理近来备受关注,尤其是在生物医学领域。然而,几乎没有可供训练和测试应用的带有否定和/或推测标注的开放获取语料库,即便有,它们有时也遵循不同的设计原则。本文比较了两个最大的带有否定和推测标注的语料库——BioScope和Genia事件——的标注原则。BioScope标注否定和模糊限制语的语言线索及其范围,而在Genia中,生物事件被标注为具有不确定性和/或否定性。
对两个语料库标注之间的差异进行了主题分类,并估算了每个类别的频率。我们发现,最大数量的差异是由于范围(覆盖文本跨度)涉及关键事件且这些事件的每个论据(包括事件中的事件)也在范围内这一问题导致的。相比之下,Genia独立处理事件中的事件的模态。
对多层标注(语言范围和生物事件)的分析表明,否定/模糊限制关键词及其范围的检测有助于确定关键事件(由主要谓词表示)的模态。另一方面,为了检测事件中的事件的否定和推测状态,必须采用基于句法的额外规则来研究模态线索和事件线索之间的依存路径。