National Centre for Text Mining, Manchester Interdisciplinary Biocentre, School of Computer Science, University of Manchester, 131 Princess Street, Manchester, M1 7DN, UK.
BMC Bioinformatics. 2011 Oct 10;12:393. doi: 10.1186/1471-2105-12-393.
Biomedical papers contain rich information about entities, facts and events of biological relevance. To discover these automatically, we use text mining techniques, which rely on annotated corpora for training. In order to extract protein-protein interactions, genotype-phenotype/gene-disease associations, etc., we rely on event corpora that are annotated with classified, structured representations of important facts and findings contained within text. These provide an important resource for the training of domain-specific information extraction (IE) systems, to facilitate semantic-based searching of documents. Correct interpretation of these events is not possible without additional information, e.g., does an event describe a fact, a hypothesis, an experimental result or an analysis of results? How confident is the author about the validity of her analyses? These and other types of information, which we collectively term meta-knowledge, can be derived from the context of the event.
We have designed an annotation scheme for meta-knowledge enrichment of biomedical event corpora. The scheme is multi-dimensional, in that each event is annotated for 5 different aspects of meta-knowledge that can be derived from the textual context of the event. Textual clues used to determine the values are also annotated. The scheme is intended to be general enough to allow integration with different types of bio-event annotation, whilst being detailed enough to capture important subtleties in the nature of the meta-knowledge expressed in the text. We report here on both the main features of the annotation scheme, as well as its application to the GENIA event corpus (1000 abstracts with 36,858 events). High levels of inter-annotator agreement have been achieved, falling in the range of 0.84-0.93 Kappa.
By augmenting event annotations with meta-knowledge, more sophisticated IE systems can be trained, which allow interpretative information to be specified as part of the search criteria. This can assist in a number of important tasks, e.g., finding new experimental knowledge to facilitate database curation, enabling textual inference to detect entailments and contradictions, etc. To our knowledge, our scheme is unique within the field with regards to the diversity of meta-knowledge aspects annotated for each event.
生物医学文献包含有关生物相关实体、事实和事件的丰富信息。为了自动发现这些信息,我们使用文本挖掘技术,这些技术依赖于标注语料库进行训练。为了提取蛋白质-蛋白质相互作用、基因型-表型/基因-疾病关联等,我们依赖于事件语料库,这些语料库使用分类、结构化的方式标注文本中包含的重要事实和发现。这些语料库为特定领域的信息抽取(IE)系统的训练提供了重要资源,有助于基于语义搜索文档。如果没有额外的信息,就不可能正确解释这些事件,例如,事件是否描述事实、假设、实验结果还是结果分析?作者对她的分析的有效性有多少信心?这些和其他类型的信息,我们统称为元知识,可以从事件的上下文中推导出来。
我们设计了一个用于生物医学事件语料库元知识丰富化的标注方案。该方案是多维的,因为每个事件都被标注了 5 个不同方面的元知识,这些元知识可以从事件的文本上下文中推导出来。用于确定值的文本线索也被标注。该方案旨在足够通用,以便与不同类型的生物事件标注集成,同时又足够详细,以捕捉文本中表达的元知识的本质上的重要细微差别。我们在这里报告了该标注方案的主要特点,以及它在 GENIA 事件语料库(1000 篇摘要,36858 个事件)中的应用。已经实现了较高水平的注释者间一致性,落在 0.84-0.93 Kappa 范围内。
通过为事件标注增加元知识,可以训练更复杂的 IE 系统,从而允许将解释性信息指定为搜索条件的一部分。这可以在许多重要任务中提供帮助,例如,查找新的实验知识以促进数据库整理,启用文本推理以检测蕴涵和矛盾等。据我们所知,我们的方案在该领域中是独一无二的,因为它为每个事件标注了多种元知识方面。