The National Centre for Text Mining, School of Computer Science, The University of Manchester, 131 Princess Street, Manchester M1 7DN, UK.
BMC Bioinformatics. 2013 Jan 16;14:2. doi: 10.1186/1471-2105-14-2.
Biomedical corpora annotated with event-level information represent an important resource for domain-specific information extraction (IE) systems. However, bio-event annotation alone cannot cater for all the needs of biologists. Unlike work on relation and event extraction, most of which focusses on specific events and named entities, we aim to build a comprehensive resource, covering all statements of causal association present in discourse. Causality lies at the heart of biomedical knowledge, such as diagnosis, pathology or systems biology, and, thus, automatic causality recognition can greatly reduce the human workload by suggesting possible causal connections and aiding in the curation of pathway models. A biomedical text corpus annotated with such relations is, hence, crucial for developing and evaluating biomedical text mining.
We have defined an annotation scheme for enriching biomedical domain corpora with causality relations. This schema has subsequently been used to annotate 851 causal relations to form BioCause, a collection of 19 open-access full-text biomedical journal articles belonging to the subdomain of infectious diseases. These documents have been pre-annotated with named entity and event information in the context of previous shared tasks. We report an inter-annotator agreement rate of over 60% for triggers and of over 80% for arguments using an exact match constraint. These increase significantly using a relaxed match setting. Moreover, we analyse and describe the causality relations in BioCause from various points of view. This information can then be leveraged for the training of automatic causality detection systems.
Augmenting named entity and event annotations with information about causal discourse relations could benefit the development of more sophisticated IE systems. These will further influence the development of multiple tasks, such as enabling textual inference to detect entailments, discovering new facts and providing new hypotheses for experimental work.
标注了事件级信息的生物医学语料库是特定领域信息提取(IE)系统的重要资源。然而,仅生物事件标注并不能满足生物学家的所有需求。与关系和事件抽取的工作不同,大多数工作都集中在特定的事件和命名实体上,我们的目标是构建一个全面的资源,涵盖话语中存在的所有因果关联陈述。因果关系是生物医学知识的核心,如诊断、病理学或系统生物学,因此,自动因果关系识别可以通过建议可能的因果联系并帮助管理途径模型,极大地减少人工工作量。因此,标注了此类关系的生物医学文本语料库对于开发和评估生物医学文本挖掘至关重要。
我们已经定义了一种注释方案,用于为生物医学领域语料库添加因果关系。该方案随后被用于标注 851 个因果关系,以形成 BioCause,这是一个由 19 篇开放获取的全文生物医学期刊文章组成的集合,属于传染病子领域。这些文档在之前的共享任务中已经针对命名实体和事件信息进行了预标注。我们使用精确匹配约束报告了触发词的超过 60%的注释者间一致性率和超过 80%的论元一致性率。在使用宽松匹配设置时,这些一致性率显著增加。此外,我们从多个角度分析和描述了 BioCause 中的因果关系。这些信息可用于训练自动因果关系检测系统。
在命名实体和事件标注中添加有关因果话语关系的信息可以使更复杂的 IE 系统受益。这将进一步影响多个任务的发展,例如能够进行文本推理以检测蕴涵、发现新事实并为实验工作提供新假设。