School of Information Technology, Faculty of Engineering and IT, University of Sydney, Sydney, Australia.
BMC Bioinformatics. 2009 Sep 24;10:311. doi: 10.1186/1471-2105-10-311.
The increasing availability of full-text biomedical articles will allow more biomedical knowledge to be extracted automatically with greater reliability. However, most Information Retrieval (IR) and Extraction (IE) tools currently process only abstracts. The lack of corpora has limited the development of tools that are capable of exploiting the knowledge in full-text articles. As a result, there has been little investigation into the advantages of full-text document structure, and the challenges developers will face in processing full-text articles.
We manually annotated passages from full-text articles that describe interactions summarised in a Molecular Interaction Map (MIM). Our corpus tracks the process of identifying facts to form the MIM summaries and captures any factual dependencies that must be resolved to extract the fact completely. For example, a fact in the results section may require a synonym defined in the introduction. The passages are also annotated with negated and coreference expressions that must be resolved.We describe the guidelines for identifying relevant passages and possible dependencies. The corpus includes 2162 sentences from 78 full-text articles. Our corpus analysis demonstrates the necessity of full-text processing; identifies the article sections where interactions are most commonly stated; and quantifies the proportion of interaction statements requiring coherent dependencies. Further, it allows us to report on the relative importance of identifying synonyms and resolving negated expressions. We also experiment with an oracle sentence retrieval system using the corpus as a gold-standard evaluation set.
We introduce the MIM corpus, a unique resource that maps interaction facts in a MIM to annotated passages within full-text articles. It is an invaluable case study providing guidance to developers of biomedical IR and IE systems, and can be used as a gold-standard evaluation set for full-text IR tasks.
越来越多的全文生物医学文章的出现将使更多的生物医学知识能够以更高的可靠性自动提取。然而,大多数信息检索 (IR) 和提取 (IE) 工具目前仅处理摘要。语料库的缺乏限制了能够利用全文文章知识的工具的开发。因此,对于全文文档结构的优势以及开发人员在处理全文文章时将面临的挑战,研究甚少。
我们手动注释了来自全文文章的段落,这些段落描述了在分子相互作用图 (MIM) 中总结的相互作用。我们的语料库跟踪了识别事实以形成 MIM 摘要的过程,并捕获了必须解决的任何事实依赖关系,以完整提取事实。例如,结果部分中的一个事实可能需要在引言中定义同义词。这些段落还带有必须解决的否定和共指表达式进行注释。我们描述了识别相关段落和可能的依赖关系的准则。语料库包括 78 篇全文文章中的 2162 个句子。我们的语料库分析表明了全文处理的必要性;确定了最常陈述相互作用的文章部分;并量化了需要一致依赖关系的交互语句的比例。此外,它使我们能够报告识别同义词和解决否定表达式的相对重要性。我们还使用语料库作为黄金标准评估集,对基于语料库的 oracle 句子检索系统进行了实验。
我们引入了 MIM 语料库,这是一个独特的资源,它将 MIM 中的相互作用事实映射到全文文章中的注释段落。它是一个宝贵的案例研究,为生物医学 IR 和 IE 系统的开发人员提供了指导,并且可以用作全文 IR 任务的黄金标准评估集。