Suppr超能文献

构建带注释语料库以支持生物医学信息抽取。

Construction of an annotated corpus to support biomedical information extraction.

机构信息

National Centre for Text Mining, Manchester Interdisciplinary Biocentre, University of Manchester, 131 Princess Street, Manchester, M1 7DN, UK.

出版信息

BMC Bioinformatics. 2009 Oct 23;10:349. doi: 10.1186/1471-2105-10-349.

Abstract

BACKGROUND

Information Extraction (IE) is a component of text mining that facilitates knowledge discovery by automatically locating instances of interesting biomedical events from huge document collections. As events are usually centred on verbs and nominalised verbs, understanding the syntactic and semantic behaviour of these words is highly important. Corpora annotated with information concerning this behaviour can constitute a valuable resource in the training of IE components and resources.

RESULTS

We have defined a new scheme for annotating sentence-bound gene regulation events, centred on both verbs and nominalised verbs. For each event instance, all participants (arguments) in the same sentence are identified and assigned a semantic role from a rich set of 13 roles tailored to biomedical research articles, together with a biological concept type linked to the Gene Regulation Ontology. To our knowledge, our scheme is unique within the biomedical field in terms of the range of event arguments identified. Using the scheme, we have created the Gene Regulation Event Corpus (GREC), consisting of 240 MEDLINE abstracts, in which events relating to gene regulation and expression have been annotated by biologists. A novel method of evaluating various different facets of the annotation task showed that average inter-annotator agreement rates fall within the range of 66% - 90%.

CONCLUSION

The GREC is a unique resource within the biomedical field, in that it annotates not only core relationships between entities, but also a range of other important details about these relationships, e.g., location, temporal, manner and environmental conditions. As such, it is specifically designed to support bio-specific tool and resource development. It has already been used to acquire semantic frames for inclusion within the BioLexicon (a lexical, terminological resource to aid biomedical text mining). Initial experiments have also shown that the corpus may viably be used to train IE components, such as semantic role labellers. The corpus and annotation guidelines are freely available for academic purposes.

摘要

背景

信息抽取(IE)是文本挖掘的一个组成部分,通过自动从大量文档集中定位有趣的生物医学事件实例,促进知识发现。由于事件通常以动词和名词化动词为中心,因此理解这些词的句法和语义行为非常重要。用有关这种行为的信息注释的语料库可以成为训练 IE 组件和资源的有价值的资源。

结果

我们定义了一种新的方案,用于注释以动词和名词化动词为中心的句子边界基因调控事件。对于每个事件实例,都将识别同一句子中的所有参与者(参数),并从针对生物医学研究文章量身定制的 13 个角色丰富集中为每个参数分配一个语义角色,以及与基因调控本体论相关的生物概念类型。就我们所知,就所识别的事件参数范围而言,我们的方案在生物医学领域是独一无二的。使用该方案,我们创建了基因调控事件语料库(GREC),其中包含 240 篇 MEDLINE 摘要,其中生物学家对与基因调控和表达相关的事件进行了注释。一种新颖的评估注释任务各个方面的方法表明,平均的注释者间一致性率在 66%-90%的范围内。

结论

GREC 是生物医学领域的独特资源,因为它不仅注释了实体之间的核心关系,还注释了这些关系的其他一些重要细节,例如位置、时间、方式和环境条件。因此,它专门用于支持特定于生物的工具和资源开发。它已经被用于获取语义框架,以包含在 BioLexicon 中(一个帮助生物医学文本挖掘的词汇、术语资源)。初步实验还表明,该语料库可以有效地用于训练 IE 组件,例如语义角色标签。语料库和注释指南可免费用于学术目的。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/915c/2774701/77db8cad2dba/1471-2105-10-349-1.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验