• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

构建带注释语料库以支持生物医学信息抽取。

Construction of an annotated corpus to support biomedical information extraction.

机构信息

National Centre for Text Mining, Manchester Interdisciplinary Biocentre, University of Manchester, 131 Princess Street, Manchester, M1 7DN, UK.

出版信息

BMC Bioinformatics. 2009 Oct 23;10:349. doi: 10.1186/1471-2105-10-349.

DOI:10.1186/1471-2105-10-349
PMID:19852798
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC2774701/
Abstract

BACKGROUND

Information Extraction (IE) is a component of text mining that facilitates knowledge discovery by automatically locating instances of interesting biomedical events from huge document collections. As events are usually centred on verbs and nominalised verbs, understanding the syntactic and semantic behaviour of these words is highly important. Corpora annotated with information concerning this behaviour can constitute a valuable resource in the training of IE components and resources.

RESULTS

We have defined a new scheme for annotating sentence-bound gene regulation events, centred on both verbs and nominalised verbs. For each event instance, all participants (arguments) in the same sentence are identified and assigned a semantic role from a rich set of 13 roles tailored to biomedical research articles, together with a biological concept type linked to the Gene Regulation Ontology. To our knowledge, our scheme is unique within the biomedical field in terms of the range of event arguments identified. Using the scheme, we have created the Gene Regulation Event Corpus (GREC), consisting of 240 MEDLINE abstracts, in which events relating to gene regulation and expression have been annotated by biologists. A novel method of evaluating various different facets of the annotation task showed that average inter-annotator agreement rates fall within the range of 66% - 90%.

CONCLUSION

The GREC is a unique resource within the biomedical field, in that it annotates not only core relationships between entities, but also a range of other important details about these relationships, e.g., location, temporal, manner and environmental conditions. As such, it is specifically designed to support bio-specific tool and resource development. It has already been used to acquire semantic frames for inclusion within the BioLexicon (a lexical, terminological resource to aid biomedical text mining). Initial experiments have also shown that the corpus may viably be used to train IE components, such as semantic role labellers. The corpus and annotation guidelines are freely available for academic purposes.

摘要

背景

信息抽取(IE)是文本挖掘的一个组成部分,通过自动从大量文档集中定位有趣的生物医学事件实例,促进知识发现。由于事件通常以动词和名词化动词为中心,因此理解这些词的句法和语义行为非常重要。用有关这种行为的信息注释的语料库可以成为训练 IE 组件和资源的有价值的资源。

结果

我们定义了一种新的方案,用于注释以动词和名词化动词为中心的句子边界基因调控事件。对于每个事件实例,都将识别同一句子中的所有参与者(参数),并从针对生物医学研究文章量身定制的 13 个角色丰富集中为每个参数分配一个语义角色,以及与基因调控本体论相关的生物概念类型。就我们所知,就所识别的事件参数范围而言,我们的方案在生物医学领域是独一无二的。使用该方案,我们创建了基因调控事件语料库(GREC),其中包含 240 篇 MEDLINE 摘要,其中生物学家对与基因调控和表达相关的事件进行了注释。一种新颖的评估注释任务各个方面的方法表明,平均的注释者间一致性率在 66%-90%的范围内。

结论

GREC 是生物医学领域的独特资源,因为它不仅注释了实体之间的核心关系,还注释了这些关系的其他一些重要细节,例如位置、时间、方式和环境条件。因此,它专门用于支持特定于生物的工具和资源开发。它已经被用于获取语义框架,以包含在 BioLexicon 中(一个帮助生物医学文本挖掘的词汇、术语资源)。初步实验还表明,该语料库可以有效地用于训练 IE 组件,例如语义角色标签。语料库和注释指南可免费用于学术目的。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/915c/2774701/42d6b2fe8143/1471-2105-10-349-2.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/915c/2774701/77db8cad2dba/1471-2105-10-349-1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/915c/2774701/42d6b2fe8143/1471-2105-10-349-2.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/915c/2774701/77db8cad2dba/1471-2105-10-349-1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/915c/2774701/42d6b2fe8143/1471-2105-10-349-2.jpg

相似文献

1
Construction of an annotated corpus to support biomedical information extraction.构建带注释语料库以支持生物医学信息抽取。
BMC Bioinformatics. 2009 Oct 23;10:349. doi: 10.1186/1471-2105-10-349.
2
Concept annotation in the CRAFT corpus.概念标注在 CRAFT 语料库中。
BMC Bioinformatics. 2012 Jul 9;13:161. doi: 10.1186/1471-2105-13-161.
3
The BioLexicon: a large-scale terminological resource for biomedical text mining.生物词典:一个用于生物医学文本挖掘的大规模术语资源。
BMC Bioinformatics. 2011 Oct 12;12:397. doi: 10.1186/1471-2105-12-397.
4
BIOSMILE: a semantic role labeling system for biomedical verbs using a maximum-entropy model with automatically generated template features.BIOSMILE:一种用于生物医学动词的语义角色标注系统,它使用带有自动生成模板特征的最大熵模型。
BMC Bioinformatics. 2007 Sep 1;8:325. doi: 10.1186/1471-2105-8-325.
5
New directions in biomedical text annotation: definitions, guidelines and corpus construction.生物医学文本注释的新方向:定义、指南与语料库构建
BMC Bioinformatics. 2006 Jul 25;7:356. doi: 10.1186/1471-2105-7-356.
6
Corpus annotation for mining biomedical events from literature.用于从文献中挖掘生物医学事件的语料库标注。
BMC Bioinformatics. 2008 Jan 8;9:10. doi: 10.1186/1471-2105-9-10.
7
Enriching a biomedical event corpus with meta-knowledge annotation.用元知识标注丰富生物医学事件语料库。
BMC Bioinformatics. 2011 Oct 10;12:393. doi: 10.1186/1471-2105-12-393.
8
BioInfer: a corpus for information extraction in the biomedical domain.生物推理(BioInfer):一个用于生物医学领域信息提取的语料库。
BMC Bioinformatics. 2007 Feb 9;8:50. doi: 10.1186/1471-2105-8-50.
9
Challenges for automatically extracting molecular interactions from full-text articles.从全文文章中自动提取分子相互作用的挑战。
BMC Bioinformatics. 2009 Sep 24;10:311. doi: 10.1186/1471-2105-10-311.
10
Semantic annotation of biological concepts interplaying microbial cellular responses.生物概念的语义标注在微生物细胞反应中的相互作用。
BMC Bioinformatics. 2011 Nov 28;12:460. doi: 10.1186/1471-2105-12-460.

引用本文的文献

1
A novel corpus of molecular to higher-order events that facilitates the understanding of the pathogenic mechanisms of idiopathic pulmonary fibrosis.一种新的分子到更高阶事件的语料库,有助于理解特发性肺纤维化的发病机制。
Sci Rep. 2023 Apr 12;13(1):5986. doi: 10.1038/s41598-023-32915-8.
2
A survey on clinical natural language processing in the United Kingdom from 2007 to 2022.2007年至2022年英国临床自然语言处理调查。
NPJ Digit Med. 2022 Dec 21;5(1):186. doi: 10.1038/s41746-022-00730-6.
3
Semantic Representation of Context for Description of Named Rivers in a Terminological Knowledge Base.

本文引用的文献

1
Large scale application of neural network based semantic role labeling for automated relation extraction from biomedical texts.基于神经网络的语义角色标注在从生物医学文本中自动提取关系方面的大规模应用。
PLoS One. 2009 Jul 28;4(7):e6393. doi: 10.1371/journal.pone.0006393.
2
U-Compare: share and compare text mining tools with UIMA.U-Compare:与 UIMA 共享和比较文本挖掘工具。
Bioinformatics. 2009 Aug 1;25(15):1997-8. doi: 10.1093/bioinformatics/btp289. Epub 2009 May 4.
3
The BioScope corpus: biomedical texts annotated for uncertainty, negation and their scopes.
术语知识库中命名河流描述的上下文语义表示
Front Psychol. 2022 Aug 18;13:847024. doi: 10.3389/fpsyg.2022.847024. eCollection 2022.
4
Building a semantically annotated corpus for chronic disease complications using two document types.使用两种文档类型构建语义标注的慢性病并发症语料库。
PLoS One. 2021 Mar 18;16(3):e0247319. doi: 10.1371/journal.pone.0247319. eCollection 2021.
5
Annotating and detecting phenotypic information for chronic obstructive pulmonary disease.标注与检测慢性阻塞性肺疾病的表型信息。
JAMIA Open. 2019 Apr 26;2(2):261-271. doi: 10.1093/jamiaopen/ooz009. eCollection 2019 Jul.
6
ProtFus: A Comprehensive Method Characterizing Protein-Protein Interactions of Fusion Proteins.ProtFus:一种全面的融合蛋白蛋白质相互作用特征分析方法。
PLoS Comput Biol. 2019 Aug 22;15(8):e1007239. doi: 10.1371/journal.pcbi.1007239. eCollection 2019 Aug.
7
Annotation and detection of drug effects in text for pharmacovigilance.用于药物警戒的文本中药物效应的标注与检测。
J Cheminform. 2018 Aug 13;10(1):37. doi: 10.1186/s13321-018-0290-y.
8
Deep learning meets ontologies: experiments to anchor the cardiovascular disease ontology in the biomedical literature.深度学习与本体论相遇:将心血管疾病本体论锚定在生物医学文献中的实验。
J Biomed Semantics. 2018 Apr 12;9(1):13. doi: 10.1186/s13326-018-0181-1.
9
Exploiting and assessing multi-source data for supervised biomedical named entity recognition.利用和评估多源数据进行有监督的生物医学命名实体识别。
Bioinformatics. 2018 Jul 15;34(14):2474-2482. doi: 10.1093/bioinformatics/bty152.
10
Collaborative relation annotation and quality analysis in Markyt environment.马克提环境中的协作关系标注与质量分析。
Database (Oxford). 2017 Jan 1;2017. doi: 10.1093/database/bax090.
生物显微镜语料库:标注了不确定性、否定及其范围的生物医学文本。
BMC Bioinformatics. 2008 Nov 19;9 Suppl 11(Suppl 11):S9. doi: 10.1186/1471-2105-9-S11-S9.
4
Recognizing speculative language in biomedical research articles: a linguistically motivated perspective.识别生物医学研究文章中的推测性语言:一种基于语言学的视角。
BMC Bioinformatics. 2008 Nov 19;9 Suppl 11(Suppl 11):S10. doi: 10.1186/1471-2105-9-S11-S10.
5
Overview of the protein-protein interaction annotation extraction task of BioCreative II.生物创意II蛋白质-蛋白质相互作用注释提取任务概述。
Genome Biol. 2008;9 Suppl 2(Suppl 2):S4. doi: 10.1186/gb-2008-9-s2-s4. Epub 2008 Sep 1.
6
Nominalization and alternations in biomedical language.生物医学语言中的名词化及变体
PLoS One. 2008 Sep 9;3(9):e3158. doi: 10.1371/journal.pone.0003158.
7
Multi-dimensional classification of biomedical text: toward automated, practical provision of high-utility text to diverse users.生物医学文本的多维分类:致力于为不同用户自动提供实用价值高的文本。
Bioinformatics. 2008 Sep 15;24(18):2086-93. doi: 10.1093/bioinformatics/btn381. Epub 2008 Aug 20.
8
Gene Regulation Ontology (GRO): design principles and use cases.基因调控本体论(GRO):设计原则与用例
Stud Health Technol Inform. 2008;136:9-14.
9
Comparative analysis of five protein-protein interaction corpora.五个蛋白质-蛋白质相互作用语料库的比较分析。
BMC Bioinformatics. 2008 Apr 11;9 Suppl 3(Suppl 3):S6. doi: 10.1186/1471-2105-9-S3-S6.
10
Getting started in text mining.文本挖掘入门。
PLoS Comput Biol. 2008 Jan;4(1):e20. doi: 10.1371/journal.pcbi.0040020.